RE: Normalisation stability, was: Compression through normalization

From: Philippe Verdy ([email protected])
Date: Tue Nov 25 2003 - 11:48:39 EST

Next message: Philippe Verdy: "RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"

Previous message: Doug Ewell: "Re: Normalisation stability, was: Compression through normalization"
In reply to: Peter Kirk: "Re: Normalisation stability, was: Compression through normalization"
Next in thread: John Cowan: "Re: Normalisation stability, was: Compression through normalization"
Reply: John Cowan: "Re: Normalisation stability, was: Compression through normalization"
Reply: Rick McGowan: "Re: Normalisation stability, was: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

De : Peter Kirk [mailto:[email protected]]
> Envoye : mardi 25 novembre 2003 17:06
> A : [email protected]
> Cc : [email protected]
> Objet : Re: Normalisation stability, was: Compression through
> normalization
>
>
> On 25/11/2003 07:22, Philippe Verdy wrote:
>
> > ...
> >
> >Composition exclusions have a lower impact as well as the
> relative orders of
> >canonical classes, as they don't affect canonical equivalence of strings,
> >and thus won't affect applications based on the Unicode C10
> definition; they
> >are important only to allow binary compares of normalized strings.
> >
> >
> Thanks for the clarification. My point is that binary compares of
> normalised strings are possible only if the strings have not been
> transformed according to C10 since normalisation; and that the need to
> support such binary compares has been used as a justification for a
> refusal to correct errors in the Unicode combining classes.

Corrections of combining classes would have been of the first type: it would
have been impossible to always match strings that were previously
canonically equivalent. So if process A denormalizes a text T using old
decompositions and combining classes, and process B normalizes T using the
new definitions, the string is damaged, as B will see distinct texts from
A=oldNFx(T) and T:

it's then impossible to guarantee that on process B:
newNFx(oldNFx(T))=newNFx(T),
and this would break all security models that expect canonical equivalence
of input strings.

Without this policy, systems would simply forget to ever normalize their
texts, and NFx forms would have been deprecated or discouraged.

I just wonder however why it was "crucial" (as Unicode says in its
Definitions chapter) to expect a relative order of distinct non-zero
combining classes. For me these combining classes are arbitrary not only on
their absolute value as they are now, but even their relative order.

The policy has gone a bit too far. What should have been indefinitely made
permanent is only the partition of all Unicode characters into separate
classes, with no repositioning within this partition, no possible splitting
and no joins. This means that if two characters are in the same class, they
must remain in the same combining class for ever. I do think that values of
combining classes should not have been made numeric, but symbolic with no
normative order of these symbols. This would have still allowed to preserve
the idea of canonically equivalent strings.

The policy is also excessive when it defines, for ever, that there will be
no more than 256 canonical classes. When some complex scripts will need more
classes than what we have now, we may run out of combining classes, so we
will need to make compromission on the important concept (linguistically) of
canonical equivalence.

Comparing strings on their binary encoding within any scheme and normalized
form is quite stupid linguistically and semantically. The relevant way to
compare strings is UCA, if one wants to create a full order, or canonical
equivalence only.

From this point, applications would have been free to use the normalization
forms they want, including for rendering or for preparation of strings to
UCA, as long as they preserve the partition of characters into their
existing classes, and do not add, modify or remove decompositions of the UCD
(composition exclusions can be safely ignored, as applications can still
compare the result by looking at their decomposition forms).

__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com

application/ms-tnef attachment: winmail.dat

Next message: Philippe Verdy: "RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"
Previous message: Doug Ewell: "Re: Normalisation stability, was: Compression through normalization"
In reply to: Peter Kirk: "Re: Normalisation stability, was: Compression through normalization"
Next in thread: John Cowan: "Re: Normalisation stability, was: Compression through normalization"
Reply: John Cowan: "Re: Normalisation stability, was: Compression through normalization"
Reply: Rick McGowan: "Re: Normalisation stability, was: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Nov 25 2003 - 12:42:25 EST