Re: Unicode Normalisaton Optimisation Experiments

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Sep 26 2003 - 07:37:47 EDT

  • Next message: Peter Kirk: "Re: Fun with proof by analogy, was Re: Mojibake on my Web pages"

    On 26/09/2003 10:52, jon@spin.ie wrote:

    > ...
    >
    >If there is a problem with this then it goes deeper than just NFC, but to the rules of how combining characters can or cannot be reordered, and the meaning that the resulting strings have. If there is a problem with that then the problem lies with those rules, rather than NFC which uses them.
    >
    >
    >
    Actually, in my opinion based on experience with problem combinations in
    Hebrew and Arabic, the problem is not so much with the reordering rules
    as with the way that some canonical decompositions and combining classes
    have been inappropriately defined, and with the stability policy which
    decrees that even the most obvious mistakes cannot be corrected.

    The problem is that the definitions in the Unicode Standard conflict
    with the stability policy. For example, from p.83 of TUS 4.0:

    > D46 Combining class: A numeric value given to each combining Unicode
    > character that
    > determines with which other combining characters it typographically
    > interacts.
    > • See Section 4.3, Combining Classes—Normative, for information about
    > the combining
    > classes for Unicode characters.
    > Characters have the same class if they interact typographically, and
    > different classes if they
    > do not.

    This is simply untrue and so needs to be changed. I have well documented
    examples from Hebrew and from Arabic of combining characters which do
    not have the same combining class but do interact typographically.
    (Unless D46 is read as a counter-intuitive definition of
    "typographically interacts".) The obvious way of correcting this error,
    to adjust the combining classes, is ruled out by the stability policy.
    So the text of the standard, which can be changed in a new version,
    needs to be changed to read something like:

    Characters have the same class if according to the best information
    available in 2001 (?) they were thought to interact typographically, and
    different classes if they were thought not to.

    Or else simply state that combining classes are assigned arbitrarily -
    as also needs to happen with Unicode character names which similarly
    contain uncorrectable errors.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Fri Sep 26 2003 - 08:23:21 EDT