From: Jonathan Coxhead (firstname.lastname@example.org)
Date: Mon Jul 28 2003 - 21:25:53 EDT
On 28 Jul 2003, at 16:49, Kenneth Whistler wrote:
> Part of the specification of the Unicode normalization algorithm
> is idempotency *across* versions, so that addition of new
> characters to the standard, which require extensions of the
> tables for decomposition, recomposition, and composition
> exclusion in the algorithm, does *not* result in a situation
> where application of a later version of the normalization algorithm
> results in change of *any* string normalized by an earlier version
> of the algorithm.
> The suggested changes in combining class values would break *that*
Is this really the case? It seems to me that if 2 letters that (in an
earlier version of Unicode) had different combining classes were changed (by a
later version) to have the same combining class, it would still be backwards
compatible. The effect is the same as if the normalisation had not been done,
and the principal of "be conservative in what you generate, but liberal in what
you accept" means that no-one should be assuming that content which they
receive has been normalised.
In other words, if you receive i-a in Hebrew, you may deduce that it is not
normalised, and normalise it yourself; and you have to do that anyway, so there
is no loss.
If what I'm saying is true, then it is always possible for new versions of
Unicode to change combining classes, as long as the following rule is observed:
---any 2 distinct character sequences which map to 2 distinct normalised
sequences must always do so, but
---if 2 distinct character sequences map to the same normalised character
sequence in an earlier version of Unicode, they may map to
distinct sequences in a later version.
(Or, in other words, information that was retained must not be lost, but
just because information was discarded by an earlier version does not mean that
it will always be discarded.)
As new characters are encoded in Unicode, *backwards* compatibility is
assured, but not forwards. If your application assumes that an unencoded code
point will remain unencoded for all time, then eventually it will get an
unpleasant shock. This is OK because we know certain kinds of change are
allowed. It is just this reasoning, applied to combining classes, that lets us
conclude that *merging* classes is allowed, but that if 2 characters have the
same class, they must have the same class forever.
This implies that if characters X and Y, with combining classes A and B,
have a semantic difference between XY and YX which we discover only belatedly,
then we may set the combining classes of both of them to some value C between A
and B (it doesn't matter which value we pick, as long as A <= C <= B), BUT we
must also set the combining classes of *all other characters* with a class D
that lies in the range A <= D <= B to C also.
I don't see why anyone who accepts that Unicode is an extensible character
set could object to such a change. And luckily, it's just what would solve the
Hebrew normalisation problem.
. . . (_|/ o n a t h a n
This archive was generated by hypermail 2.1.5 : Mon Jul 28 2003 - 22:00:05 EDT