Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Thu Dec 11 2003 - 13:16:15 EST

  • Next message: Benjamin Peterson: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"

    > Mark, don't patronise me. I'm not talking about levels of enlightenment.
    > I'm not talking about levels in the sense you just used when you
    > mentioned "higher-level issues". I'm talking about the well-known
    > concept of levels or layers of programming and of communication protocols.

    My apologies; I had intended a lighter tone, not patronization.

    > Here I disagree. As an application programmer writing for example some
    > kind of linguistic application, it is totally irrelevant to me how much
    > actual storage a string takes. Such things should be hidden away from me
    > by several levels of system software and compilers. An application
    > programmer doesn't even need to know what this concept means! Seriously!
    > Beginners, even young children, can be taught simple programming and
    > string handling without knowing anything about bits and bytes, certainly
    > without having to know whether the e acute they just typed is stored as
    > one byte or two. Just as people can and do learn to drive cars without
    > knowing anything about the nuts and bolts or how the engine works.

    A nice dream, but doesn't really match anything I know about. Programmers will
    always need to know storage counts in strings, at least in intermediate
    processing. In C, of course, it is crucial. Even in a language with String
    objects, like Java, even just getting the last bit of a string uses a length.

    a.substring(pos,a.length()-1)

    The indexing within strings is always using storage units, for good reason. Take
    a typical operation: I do a match on a string, and find out that the position of
    what I was searching for was <9,15> (in code units). I then do some other
    operations using that data, e.g. extracting a substring, or replacing the
    contents. Those all reference the indexes that I determined earlier. All of
    these processes are much faster if the indexing is always done in code units.

    You are right that higher-level tools make it less necessary to get into some of
    the guts here. Rather than have to deal with indexes, I can use a split function
    to produce an array of strings, or a regex function to search and replace. But I
    don't see how you can always get away from the need to index.

    One could, of course, design a programming language that always indexed and
    counted by some other entity, say, default grapheme clusters. Such a language
    would be be unable to deal with pieces that didn't constitute a complete
    cluster, or and have to deal with the issues such as that the number of entities
    in the concatenation of two strings is not the same as the same as the sum of
    number of numbers of entities in each of the strings, so indexing gets pretty
    tricky. I don't know of any programming language that has tried to do this, and
    I don't think it would be of particular value -- and would be exceedlingly slow.

    To take your analogy of the car, programmers are really much more like the
    mechanics than the drivers. A casual driver doesn't really need to know
    anything, although will still need to know some measurement of gas. (Maybe that
    isn't true of SUV owners -- they'd really rather not know the cumulative effects
    on their pocketbook, the environment, or international politics.) But the
    mechanics still have to know how to measure physical things. As their
    diagnostics computers get better, their tools help to alleviate a lot of the
    work they use to do by hand, but they still need to be able to fasten a bolt
    with a certain measure of torque.

    >
    > --
    > Peter Kirk
    > peter@qaya.org (personal)
    > peterkirk@qaya.org (work)
    > http://www.qaya.org/
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Dec 11 2003 - 14:11:32 EST