Re: Normalisation stability, was: Compression through normalization

From: John Cowan ([email protected])
Date: Tue Nov 25 2003 - 13:03:31 EST

Next message: Peter Constable: "RE: How can I have OTF for MacOS"

Previous message: John Cowan: "Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"
In reply to: Philippe Verdy: "RE: Normalisation stability, was: Compression through normalization"
Next in thread: Philippe Verdy: "RE: Normalisation stability, was: Compression through normalization"
Reply: Philippe Verdy: "RE: Normalisation stability, was: Compression through normalization"
Reply: Peter Kirk: "Re: Normalisation stability, was: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy scripsit:

> I just wonder however why it was "crucial" (as Unicode says in its
> Definitions chapter) to expect a relative order of distinct non-zero
> combining classes. For me these combining classes are arbitrary not only on
> their absolute value as they are now, but even their relative order.

You are misconstruing the text. What is meant is that it is crucial
to establish *some* fixed order of combining classes, not that the
extant order is not arbitrary. It would not compromise stability to
remap the existing combining class values onto the consecutive
integers from 0 to 53, but it would compromise stability to alter
the order (or to make class 0 other than 0).

> The policy is also excessive when it defines, for ever, that there will be
> no more than 256 canonical classes. When some complex scripts will need more
> classes than what we have now, we may run out of combining classes, so we
> will need to make compromission on the important concept (linguistically) of
> canonical equivalence.

We will never come close to exceeding this limit. Essentially all new
combining characters are either class 0 or fall into one of the 200-range
positional classes.

> Comparing strings on their binary encoding within any scheme and normalized
> form is quite stupid linguistically and semantically. The relevant way to
> compare strings is UCA, if one wants to create a full order, or canonical
> equivalence only.

UCA is far too heavyweight for simple applications like comparing
identifiers in programming languages or XML for identity. And as for
canonical equivalence, the most efficient way to compare strings for
it is to normalize both of them in some way and then do a raw
binary compare. Since it adds efficiency to normalize only once,
it is worthwhile to define a few normalization forms and urge
people to produce text in one of them, so that receivers need not
normalize but need only check for normalization, typically much cheaper.

-- 
Ambassador Trentino: I've said enough. I'm a man of few words.
Rufus T. Firefly: I'm a man of one word: scram!
        --_Duck Soup_                   John Cowan <[email protected]>

Next message: Peter Constable: "RE: How can I have OTF for MacOS"
Previous message: John Cowan: "Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"
In reply to: Philippe Verdy: "RE: Normalisation stability, was: Compression through normalization"
Next in thread: Philippe Verdy: "RE: Normalisation stability, was: Compression through normalization"
Reply: Philippe Verdy: "RE: Normalisation stability, was: Compression through normalization"
Reply: Peter Kirk: "Re: Normalisation stability, was: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Nov 25 2003 - 13:42:13 EST