Re: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

From: Peter Kirk (peterkirk@qaya.org)
Date: Wed Dec 10 2003 - 07:25:10 EST

  • Next message: Gupta, Shubhagam: "RE: XML based mapping files."

    On 10/12/2003 02:41, jon@hackcraft.net wrote:

    >Quoting Peter Kirk <peterkirk@qaya.org>:
    >
    >
    >
    >>OK, as a C function handling wchar_t arrays it is not expected to
    >>conform to Unicode. But if it is presented as a function available to
    >>users for handling Unicode text, for determining how many characters (as
    >>defined by Unicode) are in a string, it should conform to Unicode,
    >>including C9.
    >>
    >>
    >
    >If a function is presented as a function available to users for handling
    >Unicode text then it should do whatever it claims to do.
    >
    >
    That's not what the standard says. According to C7:

    > C7 A process shall interpret a coded character representation
    > according to the character semantics established by this standard, if
    > that process does interpret that coded character representation.
    > • This restriction does not preclude internal transformations that are
    > never visible external to the process.

    So, "If a function is presented as a function available to users for
    handling Unicode text", it has to do so in accordance with the standard,
    and is not free to do something else even if it openly claims to do that
    something else. (I understand "users" here as separate processes;
    Unicode conformance does not restrict internal functions.) And there is
    a clear intention that processes ought to treat all canonically
    equivalent strings identically, although there is a get-out clause
    allowing non-ideal implementations not to do so.

    A process is permitted to offer a function which distinguishes between
    canonically equivalent forms, but, by C9, no other process is permitted
    to rely on this distinction. This seems paradoxical but is actually
    rather sensible. Such a distinction should only be made as an accidental
    feature of a non-ideal version of a function, perhaps one which makes no
    claim to support the whole of Unicode, and ideally such a function
    should be replaced over time by an upgraded version which supports the
    whole of Unicode and makes no distinction between canonically equivalent
    forms.

    >There are concepts of "code units", "code points", "characters", and "default
    >grapheme clusters" in Unicode. Functions which count either of these are
    >perfectly conformant with Unicode, as long as the perform their task correctly.
    >
    >
    >
    I fully agree with you on "default grapheme clusters", a concept which
    is invariant under canonically equivalent transformations (that is
    right, isn't it?). These need to be counted by renderers and perhaps in
    other circumstances e.g. this is probably the right thing to count for a
    character count as an estimate of the length of a text.

    As for counting "code units", "code points" and "characters", we need to
    distinguish different levels here. Of course it is necessary to count
    such things internally within an implementation of certain Unicode
    functions e.g. normalisation, and when allocating memory space. At this
    level we are talking about a data type consisting of bytes or words for
    one of the UTF's; we are not really talking about Unicode strings.
    Obviously the wcslen function as originally discussed is supposed to
    work at this level, and there is no problem with that. The problem comes
    when the function is reapplied as a count of the length of a Unicode
    string. For one thing, it is going to give the wrong answer unless it
    uses 32-bit (well, 21-bit or more) words, as it certainly shouldn't be
    hacked to recognise surrogates. But the other problem is that to use
    this function with Unicode strings is to confuse different data types.

    I was implicitly thinking in terms of a higher level and more abstract
    data type of a Unicode string. That is the level of abstraction which
    should be offered to users i.e. other processes or application
    programmers, by, for example, a general purpose Unicode-compatible
    string handling and I/O library. Such a Unicode string data type should
    be independent of encoding form; the choice between UTF-8/16/32 etc
    should be left to the compiler. C9 implies that it should also "ideally"
    be independent of canonically equivalent form of the text, and this
    ideal can easily (though maybe not efficiently) be attained by
    automatically normalising all strings passed to and from the library.
    (Indeed one might even build into the data type definition an automatic
    normalisation process, used whenever a string is stored, but I will
    assume that this is not done.) Within such a context, a library function
    to determine whether a string is normalised is meaningless, and will
    always return TRUE; and this is completely conformant to C9.

    Within the functions associated with the data type, rather than as an
    external process or library function, there might be a place for a
    normalisation test function. On the other hand, at this level it is
    redundant, as the preferred thing to do with a non-normalised string is
    always to normalise it (or are there security-related cases where this
    does not apply?); and so if a string is required to be normalised, even
    if there is a good chance that it already is normalised, the correct
    thing to do is to normalise it again (and the normalisation function,
    operating at a lower level, may for efficiency first check normalisation
    before applying the full procedure).

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Wed Dec 10 2003 - 08:28:34 EST