Re: Level of Unicode support required for various languages

From: John H. Jenkins (jenkins@apple.com)
Date: Tue Oct 30 2007 - 16:37:58 CST

  • Next message: John H. Jenkins: "Re: Level of Unicode support required for various languages"

    I really don't want to continue this discussion because I don't think
    it's productive at this point and, frankly, my temper is fraying, but
    I'd like to make a couple of final points.

    The IRG's embedding Latin in IDSs (and yes, they do use that term) is
    wrong, not so much because they violate the formal grammar but because
    it really isn't serving the purpose the IRG intends it to serve. The
    whole reason the IRG adopted IDSs in its work was to provide a quick
    first-order way of doing unifications. Their use of the Latin text
    is, basically, an admission that a particular character cannot be
    broken down into encoded parts, in which case the IDS doesn't serve
    any genuine purpose.

    The IDCs were added to Unicode because they were added to 10646 and
    they were added to 10646 ultimately because the PRC wanted them. They
    were added without sufficient attention given to the technical
    ramifications of using them, which left the UTC scrambling to try to
    make some sort of sense as to how to actually make them work. Part of
    that was restricting their scope. It turns out that the original
    restrictions were too great and so additional uses were added.

    One of the main technical problems that the IDCs presented was there
    was no limit to the complexity of the characters potentially formed,
    making it difficult to produce systems which could even parse the
    limits of an IDS. Ultimately, however, the real problem is the
    enormous difficulty of defining normalization forms and equivalence.

    For example, a normalization algorithm would first be able to parse an
    IDS (or whatever) for validity and then make sure that all the pieces
    in it are "spelled" properly, that is, normalize each of the
    substrings. This would likely involve a huge list of known potential
    expansions for various forms.

    These problems are IMHO inherent to any scheme which attempts to
    provide a compositional model for encoding Han. (The IDCs and IDSs
    have the further known limitation of being inadequate to provide
    acceptable rendering.) This is a conclusion I come to most
    reluctantly, since I authored (years before the IDCs were added to the
    standard) a paper urging the IRG to adopt a compositional model and
    did a fair amount of leg-work on it.

    A compositional model for Han is *very* attractive given that it
    reflects the way the script works and the way that (most) new
    characters are coined. Unfortunately, the practical problems involved
    in getting that to work are much greater than initially appears to be
    the case.

    Beyond the technical problems are political problems of getting such a
    scheme to be adopted in WG2 without the approval of the PRC, and the
    PRC has shown itself to be enormously reluctant to move away from the
    approach of separately encoding each ideograph. If nothing else, the
    PRC (and other governmental bodies in the Far East) want to discourage
    people from coining new ideographs because of the headaches that
    creates.

    After all, the current set of encodable ideographs is largely the
    fault of that very same thing -- village chiefs making up a new
    ideograph for their town's name, or proud parents making up a new
    ideograph for their kid's name, or quirky authors deliberately (or
    accidentally) creating something new on the fly, or somebody creating
    a new taboo form for someone important. Leaving this set so fully
    open is a detriment to communication, not an aid, because there's no
    authoritative way to provide data on a character other than how to
    draw it. What does it mean? How is it pronounced? Who knows? It
    turns the Han script into an infinitely large set of dingbats.

    The biggest single gain in terms of the effort involved in encoding
    ideographs would derive from shifting to variation sequences for
    variants rather than attempting to encode them all separately. The
    second biggest gain would derive from insisting on stricter standards
    for data *about* an ideograph, such as its definition, pronunciation,
    and provenance.

    I'm ccing the Unicode list, even though your last message was sent
    directly to me, because I'm not actually quoting anything in that
    message.

    =====
    John H. Jenkins
    jenkins@apple.com



    This archive was generated by hypermail 2.1.5 : Tue Oct 30 2007 - 16:40:30 CST