Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Dec 11 2003 - 11:32:55 EST

  • Next message: Arcane Jill: "RE: Text Editors and Canonical Equivalence (was Coloured diacritics)"

    On 11/12/2003 07:40, Mark Davis wrote:

    >Peter, here is your original remark. Ken has gracefully filled the gap in
    >explaining the higher-level issues, but let's return to that for a minute.
    >
    >
    >
    >>>No, surely not. If the wcslen() function is fully Unicode conformant, it
    >>>should give the same output whatever the canonically equivalent form of
    >>>its input. That more or less implies that it should normalise its input.
    >>>
    >>>
    >
    >Talking about looking at the problem "at levels" really obscures the issues.
    >Programmers call functions. Those functions don't magically change when one
    >achieves a new Level of Enlightenment.
    >
    >
    Mark, don't patronise me. I'm not talking about levels of enlightenment.
    I'm not talking about levels in the sense you just used when you
    mentioned "higher-level issues". I'm talking about the well-known
    concept of levels or layers of programming and of communication protocols.

    >The function wcslen is defined as "Determines the number of characters in a
    >wide-character string." In C, those are not even defined to be Unicode
    >characters. IF Unicode is used, wide-characters (wchar_t) may be codepoints or
    >code units, depending on the implementation. The function is not defined -- and
    >could never be redefined, without huge breakage -- to return the number of NFC
    >codepoints.
    >
    >
    >
    I understand that. There is no problem is wcslen is used for the
    function it is defined for, in terms of counting storage locations or
    units in one of the UTF's. The problem arose when someone else on the
    list suggested that this same function could be used to count Unicode
    characters. At the time I suggested that this should not be done and
    would have problems with Unicode conformance, and that the only
    meaningful counting that should be done was of something like default
    grapheme clusters. Ken has since convinced me that it is sensible and
    conformant to count the number of Unicode code units or code points in a
    string as long as one is working with the string as an entity to be
    manipulated programmatically. But this should not be done when one is
    dealing with any kind of "interpretation" as that must respect canonical
    equivalence.

    >Part of the problem is that "character" can be interpreted in a wide variety of
    >ways, which is why we were forced into developing more precise terms like code
    >units. So in general:
    >
    >1. If you want a function that returns the number of code units in X, you need
    >to call one that is defined to do so.
    >2. If you want a function that returns the number of code points in X, you need
    >to call one that is defined to do so.
    >3. If you want a function that returns the number of code points in toNFC(x),
    >you need to call one that is defined to do so.
    >4. If you want a function that returns the number of grapheme clusters in X, you
    >need to call one that is defined to do so.
    >5. If you want a function that returns the number of glyphs in X using font F
    >and parameters P, you need to call one that is defined to do so.
    >- And so on.
    >
    >There is a pattern here.
    >
    >
    >
    Of course. The original problem was that someone was trying t o use a
    function defined for one thing to do something different.

    >Of course in reality, there might not be individual functions for these. The
    >most commonly used of these functions will always be #1, no matter what one's
    >Level of Enlightenment is. That's because people typically need to know how much
    >actual storage a string takes. ...
    >
    Here I disagree. As an application programmer writing for example some
    kind of linguistic application, it is totally irrelevant to me how much
    actual storage a string takes. Such things should be hidden away from me
    by several levels of system software and compilers. An application
    programmer doesn't even need to know what this concept means! Seriously!
    Beginners, even young children, can be taught simple programming and
    string handling without knowing anything about bits and bytes, certainly
    without having to know whether the e acute they just typed is stored as
    one byte or two. Just as people can and do learn to drive cars without
    knowing anything about the nuts and bolts or how the engine works.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Thu Dec 11 2003 - 12:24:22 EST