Re: Possible problem going forward with normalization

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 21 1999 - 22:11:43 EST


John Cowan asked:

>
> It occurs to me that when a future version of Unicode is released with
> new combining marks, text that mixes old and new marks on the same
> base character, or that generates incorrectly ordered new marks,
> will produce inconsistent results when passed through normalization.
>
> Consider the sequence LATIN SMALL LETTER A, COMBINING GRACKLE (a
> post-3.0 character of class 232), COMBINING ACUTE ACCENT.
> A 3.0 implementation of normalization will not recognize the
> COMBINING GRACKLE as a mark, and will not swap it with the
> acute mark. A post-3.0 implementation with updated tables
> will do so.
>
> What is the current thinking on this?
>

There are quite a number of problems that can occur for
normalization in future versions of Unicode.

Let's consider the following hypothetical additions for Unicode 4.0:

COMBINING GRACKLE (class 232)
COMBINING CYRILLIC DESCENDER (class 220)
LATIN SMALL LETTER S WITH COMMA ABOVE [for Barbareņo Chumash]

Now consider the program Normix, that claims Unicode 3.0
compliance, and its update Ny-Normix, that claims Unicode 4.0
compliance.

Normix expects Unicode 3.0 data. If given data with the new
Unicode 4.0 characters in it, those should be treated as
any unassigned code points would be. The strategy taken for
that would vary, depending on how clever Normix wants to be:
it could simply stop and say this is not Unicode 3.0 data,
or it could treat all unassigned characters as pass-through
ignorables for normalization, or it could do something else.
What is important to note, however, is that Normix could have
no information about the combining classes or decompositions of
any of these new characters. Any newly added information is
outside its scope.

Ny-Normix expects Unicode 4.0 data. Therefore, it should handle
any new 4.0 character appropriately for normalization, basing its
handling on the decompositions and combining classes assigned
in the corresponding UnicodeData-4.0.0.txt file that defines
them. (For the purposes of this discussion, it doesn't really
matter whether the update from Normix to Ny-Normix involves
hard-coded differences in a program, or whether Ny-Normix simply
reads the new data files and adjusts itself automatically.)

Now consider some possible Unicode 4.0 data:

LATIN SMALL LETTER A, COMBINING GRACKLE, COMBINING ACUTE ACCENT

Ny-Normix will normalize this (Form D) to:

LATIN SMALL LETTER A, COMBINING ACUTE ACCENT, COMBINING GRACKLE

Normix, of course, could not do this reordering; it would be
whistling in the dark, since it has no combining class to go
with. But that is not really a problem for Normix, since it
cannot really be expected to normalize data that is outside
its claim of conformance (characters that may not have even
existed when the program was written).

But here we are talking about *new* data anyway. The critical
issue is that Ny-Normix not report that Unicode 3.0 data normalized by
Normix is *not* in fact normalized -- or what amounts to the same
thing, that a Unicode 3.0 string normalized by Normix will compare not equal
if normalized again by Ny-Normix. It is less of a problem if Normix
is inappropriately used to normalize a Unicode 4.0 string, and the
results do not match the normalization of that string by Ny-Normix.

Second example:

CYRILLIC SMALL LETTER EN WITH DESCENDER

This is a Unicode 3.0 letter. Normix will normalize it as itself
(in any normalization form), since the letter is atomic and has no
decomposition.

The problem is what will Ny-Normix, working with Unicode 4.0 data
tables, do. If CYRILLIC SMALL LETTER EN WITH DESCENDER has a canonical
decomposition in the new tables, using the newly encoded COMBINING
CYRILLIC DESCENDER, then Ny-Normix *will* normalize this letter
differently than Normix did (for both Form D and Form C). That is
bad, since it means that upgrading from Normix to Ny-Normix would
invalidate already normalized data (that could be stored anywhere
by that time).

That leads to:

Principle 1: Post-Unicode 3.0, do not encode any new combining marks
that are intended to represent decomposable pieces of already existing
encoded characters. And if such a combining mark does get encoded,
despite everyone's best intentions, *NEVER* *EVER* use it in a
canonical decomposition in UnicodeData.txt, even if it confuses people
not to do so.

Third example:

LATIN SMALL LETTER S, COMBINING COMMA ABOVE (U+0313)

This is a valid Unicode 3.0 combining character sequence, that could
be used to represent the Chumash letter s with comma above. Normix
will normalize this sequence as itself (in any normalization form --
presuming no combining marks of class < 230 follow it), since there
is no composed character in Unicode 3.0 that this sequence is
canonically equivalent to.

The problem, once again, is what will Ny-Normix, working with Unicode
4.0 data tables, do. Now that LATIN SMALL LETTER S WITH COMMA ABOVE
has been added to the standard, with (as we presume) a decomposition
to LATIN SMALL LETTER S + COMBINING COMMA ABOVE, wouldn't Ny-Normix
normalize the existing Unicode 3.0 sequence to the newly encoded
precomposed form for normalization form C? Well, no, it wouldn't. This
is already accounted for in the specification of normalization. The
new precomposed letter would be added to the composition exclusions
table, precisely because it is necessary to do so to keep Ny-Normix
from normalizing existing Unicode 3.0 data differently than Normix
did. That keeps existing normalized data valid, even as the normalization
implementation is upgraded.

That leads to:

Principle 2: Post-Unicode 3.0, do not encode any new precomposed
characters that are already representable by sequences of base
character plus one or more combining marks. To do so would be
superfluous; processing depending on normalization will decompose
it anyway into the combining character sequence that was already
valid, so encoding it as a precomposed character does nothing but
add another equivalence to the already overburdened tables, without
accomplishing what the encoding proposer presumably intended.

If, contrary to everyone's best intentions, such precomposed characters
do get encoded, give them canonical decompositions in the new
version of UnicodeData.txt *and* add them to CompositionExclusions.txt,
so that updated normalizers do not disrupt already normalized text.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:57 EDT