RE: Normalisation stability, was: Compression through normalization

From: Philippe Verdy (
Date: Tue Nov 25 2003 - 11:48:39 EST

  • Next message: Philippe Verdy: "RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"

    De : Peter Kirk []
    > Envoye : mardi 25 novembre 2003 17:06
    > A :
    > Cc : Unicode@Unicode.Org
    > Objet : Re: Normalisation stability, was: Compression through
    > normalization
    > On 25/11/2003 07:22, Philippe Verdy wrote:
    > > ...
    > >
    > >Composition exclusions have a lower impact as well as the
    > relative orders of
    > >canonical classes, as they don't affect canonical equivalence of strings,
    > >and thus won't affect applications based on the Unicode C10
    > definition; they
    > >are important only to allow binary compares of normalized strings.
    > >
    > >
    > Thanks for the clarification. My point is that binary compares of
    > normalised strings are possible only if the strings have not been
    > transformed according to C10 since normalisation; and that the need to
    > support such binary compares has been used as a justification for a
    > refusal to correct errors in the Unicode combining classes.

    Corrections of combining classes would have been of the first type: it would
    have been impossible to always match strings that were previously
    canonically equivalent. So if process A denormalizes a text T using old
    decompositions and combining classes, and process B normalizes T using the
    new definitions, the string is damaged, as B will see distinct texts from
    A=oldNFx(T) and T:

    it's then impossible to guarantee that on process B:
    and this would break all security models that expect canonical equivalence
    of input strings.

    Without this policy, systems would simply forget to ever normalize their
    texts, and NFx forms would have been deprecated or discouraged.

    I just wonder however why it was "crucial" (as Unicode says in its
    Definitions chapter) to expect a relative order of distinct non-zero
    combining classes. For me these combining classes are arbitrary not only on
    their absolute value as they are now, but even their relative order.

    The policy has gone a bit too far. What should have been indefinitely made
    permanent is only the partition of all Unicode characters into separate
    classes, with no repositioning within this partition, no possible splitting
    and no joins. This means that if two characters are in the same class, they
    must remain in the same combining class for ever. I do think that values of
    combining classes should not have been made numeric, but symbolic with no
    normative order of these symbols. This would have still allowed to preserve
    the idea of canonically equivalent strings.

    The policy is also excessive when it defines, for ever, that there will be
    no more than 256 canonical classes. When some complex scripts will need more
    classes than what we have now, we may run out of combining classes, so we
    will need to make compromission on the important concept (linguistically) of
    canonical equivalence.

    Comparing strings on their binary encoding within any scheme and normalized
    form is quite stupid linguistically and semantically. The relevant way to
    compare strings is UCA, if one wants to create a full order, or canonical
    equivalence only.

    From this point, applications would have been free to use the normalization
    forms they want, including for rendering or for preparation of strings to
    UCA, as long as they preserve the partition of characters into their
    existing classes, and do not add, modify or remove decompositions of the UCD
    (composition exclusions can be safely ignored, as applications can still
    compare the result by looking at their decomposition forms).

    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE!

    This archive was generated by hypermail 2.1.5 : Tue Nov 25 2003 - 12:42:25 EST