From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 28 2003 - 21:34:50 EDT
> After reading through some of the archives (some pointers to the relevant
> parts would be helpful, please--something beyond "consult the archives"), it
> strikes me that normalization, with its severe requirements, is going to
> eventually so distort Unicode that it will render it nearly unusable.
> Consider the thread that starts at
> http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML020/0651.html
> (from 1999, for goodness sake!):
Yep, good pointer. And, in particular, my reply on December 21, 1999:
http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML020/0655.html
(unicode-ml:unicode for people who have trouble clicking through)
which laid out the encoding consequences going forward. Mark Davis
and I were campaigning hard in 1999 to ensure that everyone went
into the home stretch for Unicode 3.0 with "eyes wide open", so that
it was clear what normalization meant after Unicode 3.0 was
published.
> Normalization will ossify Unicode: it will
> become harder and harder to accept new, clean encodings. This is truly going
> to become the tail that wags the dog.
I concur that normalization will ossify Unicode -- in part. It clearly
has established some sharp constraints that limit the freedom of
the committees to introduce characters that would have certain
kinds of equivalence relations to existing characters, the freedom
of the UTC to modify established decompositions, and the freedom
of the UTC to modify combining classes -- which is where this entire
discussion has taken off.
However, I am rather more sanguine regarding other aspects of
the standard going forward. Nothing about normalization has
prevented progress on the reasonable encoding of additional
scripts, nor the addition of many thousands of characters for
existing scripts (e.g. Han) or for symbols. In that respect it
is not ossified at all.
And the *goal* for encoding a new script is to encode it in such
a way that normalization is *not* an issue for it. An
ideal encoding for a script usually has a fairly self-evident
representation for any particular piece of text, and that
self-evident representation is the *only* such representation.
The worst normalization problems -- by far -- for Unicode are
found in the scripts (such as Latin, Greek, and Hangul) which
have long histories of legacy implementations prior to Unicode --
histories which got reflected into the standard via multiple
sets of compatibility characters, for example.
> My prediction: normalization will eventually force some sort of version
> indicator to be included in all (normalized) Unicode text. (Weak analogy:
> much as DTD references are, either explicitly or implicitly, part of all XML
> documents).
I doubt it. I think it is much more likely that the stability of
normalization per se will hold. And when people finally come to understand
that Unicode normalization forms don't meet all of their
string equivalencing needs, the pressure will grow to define other
kinds of equivalences. And that will be addressed through other
mechanisms, the seed for which is already being discussed in:
http://www.unicode.org/reports/tr30/
But that document obviously needs a lot more work in committee
before it is complete.
> Normalization and its applications (such as early normalization for string
> identity matching) may indeed be the show-stopper (today), so this question
> may be moot, but I'll ask it anyway: Are there any other uses of combining
> classes that would break (in ways apart from normalization breaking) if the
> assignments for the Hebrew vowels were changed? We might as well be sure
> that we know the entire scope of the issues involved.
Not that I know of. The reason for canonical combining classes
in the standard is their use in canonical reordering. And the
reason for canonical reordering is its use in normalization.
--Ken
>
> Ted
This archive was generated by hypermail 2.1.5 : Mon Jul 28 2003 - 22:07:33 EDT