Re: Yerushala(y)im - or Biblical Hebrew

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 28 2003 - 22:27:24 EDT

  • Next message: Jony Rosenne: "RE: Back to Hebrew, was OT:darn'd fools"

    > On 28 Jul 2003, at 16:49, Kenneth Whistler wrote:
    >
    > > Part of the specification of the Unicode normalization algorithm
    > > is idempotency *across* versions, so that addition of new
    > > characters to the standard, which require extensions of the
    > > tables for decomposition, recomposition, and composition
    > > exclusion in the algorithm, does *not* result in a situation
    > > where application of a later version of the normalization algorithm
    > > results in change of *any* string normalized by an earlier version
    > > of the algorithm.
    > >
    > > The suggested changes in combining class values would break *that*
    > > specification.
    >
    > Is this really the case?

    To the extent that the suggested changes advocate for reversing values,
    it is the case.

    > It seems to me that if 2 letters that (in an
    > earlier version of Unicode) had different combining classes were changed (by a
    > later version) to have the same combining class, it would still be backwards
    > compatible. The effect is the same as if the normalisation had not been done,
    > and the principal of "be conservative in what you generate, but liberal in what
    > you accept" means that no-one should be assuming that content which they
    > receive has been normalised.

    This runs afoul of another problem, rather than idempotency.

    Basically, if we have two combining marks, x and y, with combining
    classes 1 and 2, respectively, then for version n, we have:

    <b, x(1), y(2)>
    <b, y(2), x(1)> --NFD--> <b, x(1), y(2)>

    so the sequences <b, x, y> and <b, y, x> are canonically equivalent.

    For version n+1, where the combining class of y is changed to 1, we have:

    <b, x(1), y(1)>
    <b, y(1), x(1)>

    so the sequences <b, x, y> and <b, y, x> are *not* canonically equivalent.

    So while it is true that application of Version n+1 normalization to
    the Version n normalized string <b, x, y> would not *change* the string,
    the problem is that under Version n+1 normalization the determination
    of canonical equivalence has changed from yes to no.

    It is the combination of both of these requirements: idempotency and
    stability of canonical equivalence, that explains the wording of
    the stability guarantees, including 3f:

      "the order relation (greater than, equal to, or less than) of the
       canonical combining classes of any two characters will never change"
       
    The problem is not in assuming the wrong thing about the normalization
    status of the data you process, but rather that the determination of
    canonical equivalence of the same two strings would change between
    versions.

    > If what I'm saying is true, then it is always possible for new versions of
    > Unicode to change combining classes, as long as the following rule is observed:
    >
    > ---any 2 distinct character sequences which map to 2 distinct normalised
    > sequences must always do so, but
    >
    > ---if 2 distinct character sequences map to the same normalised character
    > sequence in an earlier version of Unicode, they may map to
    > distinct sequences in a later version.

    Nope. Not allowed.

    > As new characters are encoded in Unicode, *backwards* compatibility is
    > assured, but not forwards. If your application assumes that an unencoded code
    > point will remain unencoded for all time, then eventually it will get an
    > unpleasant shock. This is OK because we know certain kinds of change are
    > allowed. It is just this reasoning, applied to combining classes, that lets us
    > conclude that *merging* classes is allowed, but that if 2 characters have the
    > same class, they must have the same class forever.

    There are all kinds of possible scenarios where things would be "broken"
    if something adjudged to be the *same* now, under the canonical
    equivalence rules, suddenly becomes *different* under a future set
    of rules. This kind of interoperability problem is ruled out by
    the specification in the standard.

    > This implies that if characters X and Y, with combining classes A and B,
    > have a semantic difference between XY and YX which we discover only belatedly,
    > then we may set the combining classes of both of them to some value C between A
    > and B (it doesn't matter which value we pick, as long as A <= C <= B), BUT we
    > must also set the combining classes of *all other characters* with a class D
    > that lies in the range A <= D <= B to C also.
    >
    > I don't see why anyone who accepts that Unicode is an extensible character
    > set could object to such a change. And luckily, it's just what would solve the
    > Hebrew normalisation problem.

    Well, you've just got my objection, which I think reflects the reasoning
    of the UTC regarding how this has to be.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Jul 28 2003 - 22:58:58 EDT