Re: Normalisation stability, was: Compression through normalization

From: John Cowan (
Date: Tue Nov 25 2003 - 13:03:31 EST

  • Next message: Peter Constable: "RE: How can I have OTF for MacOS"

    Philippe Verdy scripsit:

    > I just wonder however why it was "crucial" (as Unicode says in its
    > Definitions chapter) to expect a relative order of distinct non-zero
    > combining classes. For me these combining classes are arbitrary not only on
    > their absolute value as they are now, but even their relative order.

    You are misconstruing the text. What is meant is that it is crucial
    to establish *some* fixed order of combining classes, not that the
    extant order is not arbitrary. It would not compromise stability to
    remap the existing combining class values onto the consecutive
    integers from 0 to 53, but it would compromise stability to alter
    the order (or to make class 0 other than 0).

    > The policy is also excessive when it defines, for ever, that there will be
    > no more than 256 canonical classes. When some complex scripts will need more
    > classes than what we have now, we may run out of combining classes, so we
    > will need to make compromission on the important concept (linguistically) of
    > canonical equivalence.

    We will never come close to exceeding this limit. Essentially all new
    combining characters are either class 0 or fall into one of the 200-range
    positional classes.

    > Comparing strings on their binary encoding within any scheme and normalized
    > form is quite stupid linguistically and semantically. The relevant way to
    > compare strings is UCA, if one wants to create a full order, or canonical
    > equivalence only.

    UCA is far too heavyweight for simple applications like comparing
    identifiers in programming languages or XML for identity. And as for
    canonical equivalence, the most efficient way to compare strings for
    it is to normalize both of them in some way and then do a raw
    binary compare. Since it adds efficiency to normalize only once,
    it is worthwhile to define a few normalization forms and urge
    people to produce text in one of them, so that receivers need not
    normalize but need only check for normalization, typically much cheaper.

    Ambassador Trentino: I've said enough. I'm a man of few words.
    Rufus T. Firefly: I'm a man of one word: scram!
            --_Duck Soup_                   John Cowan <>

    This archive was generated by hypermail 2.1.5 : Tue Nov 25 2003 - 13:42:13 EST