# Re: Possible problem going forward with normalization

From: Mark E. Davis (markdavis@ispchannel.com)
Date: Mon Dec 27 1999 - 01:13:17 EST

Martin has very good advice here. Notice also an important feature of
normalization:

Suppose the NewSystem normalizes the some data containing character X, and
sends that data to OldSystem, where X is undefined. Since X is undefined, it
should be given a canonical combining order of zero. Then any normalization
on OldSystem with and old version of Unicode will leave X with the same
ordering relative to other characters. For example, suppose we have:

A, dot_under, X, grave.

On NewSystem, X could have either a canonical value of zero, or anything
from 220 to 230. [See
"ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html#Canonical Combining
Classes" for a list of the canonical classes.] Either one is proper,
canonical order. On OldSystem, if X is given the value zero, then any
canonical rearrangement will still leave the values in correct canonical
order on both systems. Of course, the probability is very high that any new
character will have a canonical value of 0 in any event.

The only problem would occur if a program on OldSystem altered the text. If
it inserted a new accent -- acute -- then it does make a difference what the
canonical combining class of X was. OldSystem would think that the following
was in canonical order:

A, dot_under, acute, X, grave.

while if X had the value 222 (for example) on NewSystem, then the proper
order would be:

A, dot_under, X, acute, grave.

The only way we could make this absolutely work across new versions is:

a) define ranges of unassigned characters for particular combining classes,
and only assign non-zero combining class characters in those ranges.
b) restrict the number of canonical ordering classes to the current set of
values. (This is not strictly necessary, but makes (a) easier.)

Mark

"Martin J. Duerst" wrote:

> At 16:03 1999/12/21 -0800, John Cowan wrote:
> > It occurs to me that when a future version of Unicode is released with
> > new combining marks, text that mixes old and new marks on the same
> > base character, or that generates incorrectly ordered new marks,
> > will produce inconsistent results when passed through normalization.
> >
> > Consider the sequence LATIN SMALL LETTER A, COMBINING GRACKLE (a
> > post-3.0 character of class 232), COMBINING ACUTE ACCENT.
> > A 3.0 implementation of normalization will not recognize the
> > COMBINING GRACKLE as a mark, and will not swap it with the
> > acute mark. A post-3.0 implementation with updated tables
> > will do so.
> >
> > What is the current thinking on this?
>
> Ken has given all the details. They show that the problems
> that indeed can appear can be minimized by being careful
> when introducing new things post Unicode 3.0. The whole
> idea of normalization and the exact details of each
> form, in particular normalization form C, where carefully
> considered to reduce as much as possible the inpact of
> new introductions. Of course, everybody knew that it would
> not be possible to reduce this impact to zero.
>
> One more thing that is very important to consider is where
> this normalization should be applied. The W3C character
> model (http://www.w3.org/TR/charmod) very clearly says that
> normalization should be applied as early as possible. This
> has a very strong reason: The closer to the 'origin' of a
> character, the higher the chance that information about
> that character will be around, and that therefore normalization
> will be done correctly.
>
> Translated to our examples, what we really should consider
> is two editors Editix and Ny-Editix (and not two normalization
> programs Normix and Ny-Normix). Editix will allow you to create/
> edit text in Unicode 3.0, Ny-Editix in Unicode 4.0. Both of them
> may use whatever representation they like internally, but
> externally, Editix will use Unicode 3.0 in Normal Form C,
> and Ny-Editix will use Unicode 4.0 in Normal Form C.
> Editix does not allow you to create characters new in Unicode
> 4.0, and therefore an Unicode 3.0-based Normal Form C is
> all that is needed.
>
> Of course, there is the question of what's the real origin
> of a character. Rather than the editor, this may be the
> keyboard driver. Where keyboard drivers generate the
> relevant characters, they should also make sure they are
> appropriately normalized.
>
> So the general idea is not 'everybody normalize every time
> they see some data', but 'normalize early, don't let
> unnormalized data show up at all'.
>
> Regards, Martin.
>
> #-#-# Martin J. Du"rst, World Wide Web Consortium
> #-#-# mailto:duerst@w3.org http://www.w3.org

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:57 EDT