Re: Yerushala(y)im - or Biblical Hebrew

From: Jonathan Coxhead (
Date: Mon Jul 28 2003 - 21:25:53 EDT

  • Next message: John Cowan: "Re: Yerushala(y)im - or Biblical Hebrew"

       On 28 Jul 2003, at 16:49, Kenneth Whistler wrote:

    > Part of the specification of the Unicode normalization algorithm
    > is idempotency *across* versions, so that addition of new
    > characters to the standard, which require extensions of the
    > tables for decomposition, recomposition, and composition
    > exclusion in the algorithm, does *not* result in a situation
    > where application of a later version of the normalization algorithm
    > results in change of *any* string normalized by an earlier version
    > of the algorithm.
    > The suggested changes in combining class values would break *that*
    > specification.

       Is this really the case? It seems to me that if 2 letters that (in an
    earlier version of Unicode) had different combining classes were changed (by a
    later version) to have the same combining class, it would still be backwards
    compatible. The effect is the same as if the normalisation had not been done,
    and the principal of "be conservative in what you generate, but liberal in what
    you accept" means that no-one should be assuming that content which they
    receive has been normalised.

       In other words, if you receive i-a in Hebrew, you may deduce that it is not
    normalised, and normalise it yourself; and you have to do that anyway, so there
    is no loss.

       If what I'm saying is true, then it is always possible for new versions of
    Unicode to change combining classes, as long as the following rule is observed:

          ---any 2 distinct character sequences which map to 2 distinct normalised
                sequences must always do so, but

          ---if 2 distinct character sequences map to the same normalised character
                 sequence in an earlier version of Unicode, they may map to
                 distinct sequences in a later version.

       (Or, in other words, information that was retained must not be lost, but
    just because information was discarded by an earlier version does not mean that
    it will always be discarded.)

       As new characters are encoded in Unicode, *backwards* compatibility is
    assured, but not forwards. If your application assumes that an unencoded code
    point will remain unencoded for all time, then eventually it will get an
    unpleasant shock. This is OK because we know certain kinds of change are
    allowed. It is just this reasoning, applied to combining classes, that lets us
    conclude that *merging* classes is allowed, but that if 2 characters have the
    same class, they must have the same class forever.

       This implies that if characters X and Y, with combining classes A and B,
    have a semantic difference between XY and YX which we discover only belatedly,
    then we may set the combining classes of both of them to some value C between A
    and B (it doesn't matter which value we pick, as long as A <= C <= B), BUT we
    must also set the combining classes of *all other characters* with a class D
    that lies in the range A <= D <= B to C also.

       I don't see why anyone who accepts that Unicode is an extensible character
    set could object to such a change. And luckily, it's just what would solve the
    Hebrew normalisation problem.

     . . . (_|/ o n a t h a n

    This archive was generated by hypermail 2.1.5 : Mon Jul 28 2003 - 22:00:05 EDT