Re: New contribution N2676

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 25 2003 - 05:11:45 CST


From: "Peter Kirk" <peterkirk@qaya.org>
> Have combining classes actually been defined for these characters?
>
> This is of course exactly the same problem as with Hebrew vowel points
> and accents, except that this time it applies to real living languages.
> Perhaps it is time to do something about these combining classes which
> conflict with the standard.

Do you mean officially documenting the correct (and strict) use of CGJ as
the only way to bypass the default order required by the combining classes
in normalized forms? It would be a good idea to document officially which
use of CGJ is superfluous and should be avoided in NF forms, and which use
is required.

1) This will affect only the input methods for those languages that need to
"swap" the standard order of combining characters to keep their logical
order (all this will require is a additional input control that will try
swapping ambiguous orders).

2) A complete documentation may need to specify which pairs of combining
characters are affected (this should list the pairs of combining characters
<c1, c2> where CC(c1) > CC(c2) and that require to be encoded <c1, CGJ, c2>
to be kept in logical order, as the sequence <c1, c2> will be reordered into
<c2, c1> in normalized forms.

3) The other issue would be that there may exist other combining characters
than those in this pair.
Suppose I want to represent <base, c1, c2, c3>, where CC(c1) > CC(c2), but
c3 does not have a conflicting pair in the previous list. Should it be
encoded as <base, c1, CGJ, c2, c3> or as <base, c1, c3, CGJ, c2>? As the
standard normalization algorithm cannot be changed, both sequences will be
possible with the NF forms, even though they represent the same character.

One could design an extra normalization step to force one interpretation (so
that only combining characters with conflicting combining classes that have
been forced "swapped" will appear after CGJ, all other diacritics being
encoded preferably in the first sequence before the CGJ).

This extra step should not be part of the NF forms (because Unicode states
that normailzed forms will be kept normalized in all further versions of
Unicode), but this could be named differently, by describing a system in
which extra normalization steps may be applied that may change NF forms into
other "equivalent" sequences also in normalized form.



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST