Re: Normalization and the principled U-turn

From: Mark E. Davis (
Date: Tue Sep 28 1999 - 12:05:03 EDT

Wait a second; let's not leap to conclusions. There is no such versioning
envisioned, and no requirement for upgrading the application as long as its
supported character repertoire does not change.

What you are saying is that an application that accepts unnormalized data that
contains characters from an upstream version of Unicode will not be able to
compare that data correctly under canonical equivalence. That is correct, and
has always been true; an application can't compare characters for canonical
equivalence that it doesn't know about, even using normalization. There are no
magic wands.

However, if the data that it accepts was normalized for the upstream version,
then it does work.

But remember, most processes on text will not work anyway if the character is
from an upstream version. Just from your example, sorting won't work if the
character is outside of the repertoire, since the application won't know what
the right collation element is! Searching won't either, if any loose match is
used (e.g. caseless), or the language has different linguistic equivalencies
(see UTR#10).

This also ignores the fact that in practice both normalization and sorting are
functions of the operating system, just like display. They are not hard-linked
into the application. Once those APIs appear, then any applications that use
them are insulated by the OS from version changes. As the OS is upgraded to new
versions of Unicode, its supported character repertoire is expanded, and
effectively so is the application's.


Michael Everson wrote:

> I have noted that some provision for versioning of the normalization
> algorithms is presaged in UTR#15. It makes me shudder, thinking what will
> happen what my (fictitious) typesetting program, 2000 Unicode XPress
> Version 3, would do when presented with post-Unicode-3.0 data which has the
> æ-grave precomposed as well as æ with combining grave. Easy to tell:
> sorting and searching won't work for the precomposed latter but it will
> work for the decomposed sequence; that is, my Unicode XPress will be broken
> until such time as its owners spend time and money on revving up to the new
> normalization algorithm, and until such time as I spend my money on an
> upgrade I didn't need except for some vowels I never use.
> In other words, it's not enough that we have to upgrade fonts for new
> versions of the standard (which is reasonable as scripts get added) but all
> our apps as well.
> Lithuanians (little market) have asked for a bunch of new characters as you
> know. But Japan (big market) also has new "compatibility" characters,
> including æ-grave. So the Will of Industry is soon to be tested.
> --
> Michael Everson * Everson Gunn Teoranta *
> 15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
> Guthán: +353 1 478 2597 ** Facsa: +353 1 478 2597 (by arrangement)
> 27 Páirc an Fhéithlinn; Baile an Bhóthair; Co. Átha Cliath; Éire

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT