Re: A basic question on encoding Latin characters

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Sep 24 1999 - 13:05:25 EDT


Dan Oscarsson asked:

>
> >The problem is that none of those reasons are good *enough*. And with
> >the introduction of Unicode 3.0 and the way it is tied to normalization
> >forms that soon will become ubiquitous on the Internet, the associated
> >costs for encoding new precomposed characters have risen steeply,
> >and the associated benefits have been lessened (since they are going
> >to end up decomposed in the normalization form seen on the Internet
> >anyway).
>
> WHY are we going to have unicode in decomposed normalised form on the
> Internet?
>
> I have seen many wanting normalisation as defined by form C in
> unicode report #15.
> I clearly prefer that instead of some decomosed normalised form!

I didn't say that we would have Normalization Form D (canonical
decomposition) on the Internet. What W3C is going to standardize on
is Normalization Form C (canonical composition).

For Unicode 3.0 (and ISO/IEC 10646-1:2000, which amounts to the same thing),
everything will be fine. Any combining character sequence that can
recompose canonically will do so, and any precomposed Latin accented
character in the standard will stay unchanged under that normalization
form.

The problem is for any *additions* to the set of precomposed Latin
(or for that matter Cyrillic or other) accented characters in the
*future*. Please read the specification for Normalization Form C in
UTR #15 carefully. To obtain Normalization Form C you must first
decompose a character stream according to the *latest* UnicodeData.txt
list of decompositions, corresponding to the level of Unicode that
you support, and the *recompose* according to the UnicodeData-3.0.0.txt
corresponding to the just released Unicode Version 3.0. The implication
of this is that in the future, if a precomposed Latin accented
character is added to the standard, it will decompose for Normalization
Form C but *not* recompose, since no recomposition rule for it will
be found in UnicodeData-3.0.0.txt.

This state of affairs may seem counterintuitive to you, but it is
required for normalization to work. Say we added, to pick a recently
mentioned example, c-underdot for Tamazight in Latin orthography.
You might expect that that should stay a single character under
Normalization Form C (i.e. decompose to c + combining dot under,
and then recompose to c-underdot). But the problem is that there
would not be any rule for recomposing to c-underdot, since the
precomposed character did not exist in Unicode 3.0. And the
combining character sequence c + combining dot under may already
exist in data to represent the intended text element. If the
c-underdot did not decompose to c + combining dot under, there would
be a failure of normalization: two sequences intended to be
identical under normalization would not actually be identical.
So why not update the version of the database used for the
recomposition rules? The problem with that is that adding any
new recomposition rules would have the effect of making any
pre-existing normalized data potentially ill-formed, since if you
reran the normalization algorithm over it and it had a sequence
impacted by the newly added recomposition rules, it would change
from what it was before. That situation would be absolutely
unacceptable for data used for digital signatures, for example.
Therefore, the version of the database used for *re*composition must
stay fixed.

If you have followed all of that, you should now see why there is no
sufficiently good reason to add any *more* precomposed Latin accented
characters to the standard. Under Normalization Form C (as well as
Normalization Form D) they are all going to decompose anyway. So
you have gained little, removed an available codepoint from the BMP
for nothing, and incrementally increased the table load that
everyone has to carry around to do the decompositions.

--Ken

>
> Dan
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT