Re: Unicode plane 14 language tags.

From: Doug Ewell (
Date: Tue Oct 29 2002 - 12:04:26 EST

  • Next message: Kent Karlsson: "RE: Character identities"

    William Overington <WOverington at ngo dot globalnet dot co dot uk>

    > I do note however that review 3 refers to a document which is only
    > available to Unicode Consortium members, which seems a strange thing
    > if views of interested individuals are being sought.

    I agree.

    > Also, it is a pity that this new era of Unicode glasnost (displayed
    > with a ligature? :-) ) comes so shortly after the last Unicode
    > Technical Committee meeting the minutes of which state the consensus
    > about no more ligatures being added to the U+FBxx block. Surely the
    > matter of ligatures would be a good topic upon which to conduct such
    > a public review.

    No, it wouldn't, and here's why:

    There is a concept in Unicode called "normalization" in which certain
    characters or sequences are considered to be equal to other characters
    or sequences for comparison purposes. Using this concept, a capital A
    plus a combining acute accent (U+0041, U+0301) can be considered
    equivalent to a precomposed A-with-acute (U+00C1). See Unicode Standard
    Annex #15 [1] for more information.

    It's important to realize that the *whole reason* this mechanism exists
    is because of the precomposed ligatures and letters-with-diacritic and
    compatibility characters in Unicode. If there were only one way to
    express the concept of "A with acute" in Unicode, there would be no need
    for normalization.

    Industry standards, such as the forthcoming Internationalized Domain
    Name Architecture, depend on normalization to ensure that users don't
    get unexpected mismatches between "A plus combining acute" and
    "precomposed A-with-acute." And because these standards and their
    implementations are built to specific versions of the Unicode Standard,
    they require stability in the normalization process.

    If a new precomposed ligature "character" were added to Unicode, there
    would now be two ways of "spelling" a sequence that supposedly only had
    one spelling. Let's suppose, JUST FOR ILLUSTRATION, that Unicode added
    a "ct" ligature at U+FB07. Now there would be two ways of writing the
    sequence "ct": with the regular Latin letters (U+0063, U+0074) or with
    the ligature (U+FB07). But none of the existing normalization tables
    would equate these two, because the ct ligature did not exist in the
    (previous) version of Unicode that was used to create the normalization
    table. Thus normalization would work in some cases but not others,
    which would make the whole concept unstable and unpredictable and

    That is why Unicode and WG2 have a policy [2] against adding new
    precomposed ligatures and letters-with-diacritic, to the U+FBxx block or
    anywhere else. They would break the stability of normalization, a
    concept whose entire value lies in its stability. That is why the "ct"
    ligature will not be added at U+FB07, and that is also why the National
    Taitung Teachers College will not see their 42 precomposed Latin letters
    added to Unicode. It is a good, sensible, well-thought-out policy that
    will not benefit from public review.

    Now, the Plane 14 language tag characters are a different matter
    entirely. There the UTC proposes not to add something in violation of
    its existing policy, but to formally discourage something that was just
    added only a couple of years ago. I am actually arguing for *greater*
    stability in the Unicode Standard, by arguing against the process of
    adding and then immediately deprecating features like language tags.
    (That is not my only argument for Plane 14, but it is one.)

    -Doug Ewell
     Fullerton, California


    This archive was generated by hypermail 2.1.5 : Tue Oct 29 2002 - 13:02:17 EST