Re: Yerushala(y)im - or Biblical Hebrew

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 28 2003 - 21:34:50 EDT

  • Next message: John Cowan: "Re: Yerushala(y)im - or Biblical Hebrew"

    > After reading through some of the archives (some pointers to the relevant
    > parts would be helpful, please--something beyond "consult the archives"), it
    > strikes me that normalization, with its severe requirements, is going to
    > eventually so distort Unicode that it will render it nearly unusable.
    > Consider the thread that starts at
    > http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML020/0651.html
    > (from 1999, for goodness sake!):

    Yep, good pointer. And, in particular, my reply on December 21, 1999:

    http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML020/0655.html

    (unicode-ml:unicode for people who have trouble clicking through)

    which laid out the encoding consequences going forward. Mark Davis
    and I were campaigning hard in 1999 to ensure that everyone went
    into the home stretch for Unicode 3.0 with "eyes wide open", so that
    it was clear what normalization meant after Unicode 3.0 was
    published.

    > Normalization will ossify Unicode: it will
    > become harder and harder to accept new, clean encodings. This is truly going
    > to become the tail that wags the dog.

    I concur that normalization will ossify Unicode -- in part. It clearly
    has established some sharp constraints that limit the freedom of
    the committees to introduce characters that would have certain
    kinds of equivalence relations to existing characters, the freedom
    of the UTC to modify established decompositions, and the freedom
    of the UTC to modify combining classes -- which is where this entire
    discussion has taken off.

    However, I am rather more sanguine regarding other aspects of
    the standard going forward. Nothing about normalization has
    prevented progress on the reasonable encoding of additional
    scripts, nor the addition of many thousands of characters for
    existing scripts (e.g. Han) or for symbols. In that respect it
    is not ossified at all.

    And the *goal* for encoding a new script is to encode it in such
    a way that normalization is *not* an issue for it. An
    ideal encoding for a script usually has a fairly self-evident
    representation for any particular piece of text, and that
    self-evident representation is the *only* such representation.

    The worst normalization problems -- by far -- for Unicode are
    found in the scripts (such as Latin, Greek, and Hangul) which
    have long histories of legacy implementations prior to Unicode --
    histories which got reflected into the standard via multiple
    sets of compatibility characters, for example.

    > My prediction: normalization will eventually force some sort of version
    > indicator to be included in all (normalized) Unicode text. (Weak analogy:
    > much as DTD references are, either explicitly or implicitly, part of all XML
    > documents).

    I doubt it. I think it is much more likely that the stability of
    normalization per se will hold. And when people finally come to understand
    that Unicode normalization forms don't meet all of their
    string equivalencing needs, the pressure will grow to define other
    kinds of equivalences. And that will be addressed through other
    mechanisms, the seed for which is already being discussed in:

    http://www.unicode.org/reports/tr30/

    But that document obviously needs a lot more work in committee
    before it is complete.

    > Normalization and its applications (such as early normalization for string
    > identity matching) may indeed be the show-stopper (today), so this question
    > may be moot, but I'll ask it anyway: Are there any other uses of combining
    > classes that would break (in ways apart from normalization breaking) if the
    > assignments for the Hebrew vowels were changed? We might as well be sure
    > that we know the entire scope of the issues involved.

    Not that I know of. The reason for canonical combining classes
    in the standard is their use in canonical reordering. And the
    reason for canonical reordering is its use in normalization.

    --Ken

    >
    > Ted



    This archive was generated by hypermail 2.1.5 : Mon Jul 28 2003 - 22:07:33 EDT