From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Jan 15 2005 - 20:12:32 CST
From: "Peter Kirk" <peterkirk@qaya.org>
> Elaine, the good news for you is that if you order your Unicode Hebrew
> text according to these 'alternative combining classes' you will not be
> deviating at all from the Unicode standard. Your text will not be
> normalised in any of the standard normalisation forms, but the standard
> nowhere specifies that texts must be normalised. Of course you need to
> ensure that your text is not normalised by other processes, or that if it
> is you then restore it to the order of the 'alternative combining
> classes' - a process which should be reversible.
Note that you can't define "alternative combining classes" the way you want,
if you need to preserve canonincal equivalence.
Notably:
(1) you can't change a non-zero combining class into a zero combining class
(and the reverse as well): what this means is that starters will remain
starters and non-starters in combining sequences will remain non-starter.
(2) if you change the combining class of some character without changing it
as well for other combining characters in the same class, the result is that
you may break canonical equivalence, as the new classes may be reordered
freely, when the characters sharing the same standard class would have
remained in their relative order. One example: changing the combining class
of the upper-right form of the cedilla to match its special positioning on
some letters, without changing as well the combining class of all other
diacritics attached below, will break canonical equivalence.
In other words: all unicode characters are in a strictly partitioned space,
defined as distinct set of characters sharing the same combining class.
These subsets are immutable (you can't move one character from one subset to
another, without breaking the canonical equivalence).
However these subsets are numbered quite arbitrarily (from 0 to 255), but
the absolute or even relative value of this number has no importance (except
for class 0, and for the normalization forms where only the relative order
of non-0 combining classes matters).
For these reasons, there's no way to have the combining class values match
all actual positioning interactions of combining characters. The assigned
"names" to combining classes are not accurate in all cases and represent
just an approximation. If the relative order of two diacritics in a
combining sequence is important, they must either share the same combining
class, or be separated by a non-starter control (like CGJ, or ZWJ and ZWNJ).
Also the restriction on combining class 0 means that there's no way to
transform a single grapheme cluster encoded by two successive but separate
combining sequences, into a single combining sequence (this will be
important for most Indian and South-East Asian scripts, with some known
interactions between combining sequences, notably those with VIRAMA-like
characters).
Correct processing of text cannot depend only combining sequences. So the
impact of "incorrect" relative order of combining classes is very tiny,
given that it is not the proper level of abstraction to handle these cases.
So if you need to change combining classes into custom ones for rendering
purpose, you will do that as part of the processing that allows transforming
a string from logical to physical order. This may create identical results
from strings that are initially canonically different, and users won't be
able to see any difference when they look characters at the grapheme cluster
level!
So, the concept of canonical equivalence and combining classes is not
adapted to linguistic analysis, and it is not even enough for some security
related encodings (where the better concept to use should be the collation
of strings). It is just a simplification of a more complex case, that allows
reducing (sometimes) the number of possibilities to encode the logically
"same" string, and the number of strings that should be recognized.
Unfortunately, this does not reduce this number to 1 and only 1 (but this
can be paliated by orthographic conventions applied to encoded texts, such
as suggested in UTN #19).
This archive was generated by hypermail 2.1.5 : Sun Jan 16 2005 - 12:05:53 CST