Re: Conflicting principles

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Aug 06 2003 - 18:18:46 EDT

  • Next message: Philippe Verdy: "Re: Questions on ZWNBS - for line initial holam plus alef"

    John Cowan asked:

    > I would like to ask the old farts^W^Wrespected elders of the UTC
    > which principle they consider more important, abstractly speaking:
    > the principle that combining marks always follow their base characters
    > (a typographical principle), or that text is stored, with a few minor
    > exceptions, in phonetic order (a lexicographical principle).

    As may often be the case in such hypothetical questions, I
    think there is a false dichotomy presumed here.

    The principle of the order of combining marks results from the
    need to resolve the following architectural question for the
    standard:

       Does a combining mark apply to the base character that
       precedes it or to the base character that follows it?
       
       In other words, does á = <0065, 0301> or does á = <0301, 0065>?
       
    There can only be one right answer to that question, while having
    a coherent, interoperable character encoding standard.

    The choice that the Unicode architects made on this principle in
    1989 is sacrosanct and inviolable.

    The principle of logical order of encoding results from the
    need to resolve the following architectural question for the
    standard:

       Is a right-to-left script encoded in visual order in
       the backing store or in phonetic (= logical) order?
       
       In other words, is "tsava" spelled <05E6, 05D1, 05D0> or
       <05D0, 05D1, 05E6>.
       
    There can only be one right answer to that question, while having
    a coherent, interoperable character encoding standard.

    The choice that the Unicode architects made on this principle in
    1989 is sacrosanct and inviolable.

    Everything else is just working out the details for making actual
    script encodings consistent in the context of those overarching
    principles. The status of a character as combining or not is
    up for grabs, depending on the analysis of a script's behavior
    and how it should be represented. And the layout for actual
    display of rendered texts does not, and never has, slavishly
    followed logical order in lockstep.

    Again, everyone, if you haven't already, go back and meditate
    some more on the fundamental mandala of Unicode: Figure 2-3,
    Unicode Character Code to Rendered Glyphs, which illustrates
    both issues of combining mark order with respect to base
    character and general logical order of characters as applied
    to a particular script encoding (Devanagari).

    And don't miss the following piece of text associated with that
    figure:

      "The Unicode Standard documents the default relationship
       between character sequences and glyphic appearance for the
       purpose of ensuring that the same text content can be
       stored with the same, and therefore interchangeable,
       sequence of character codes."
       
    This should, IMO, be put up on a pedestal and have the spotlights
    shined on it. This is the *fundamental* obligation of a character
    encoding standard. If you cannot accomplish this, then you just
    have a bunch of charts full of pretty pictures, and everyone is
    on their own for trying to figure out how to communicate with
    anybody else using them.

    > As someone or other said, "I believe that hitherto -- *hitherto,* mark
    > you -- [we have] entirely overlooked the existence of", well, scripts
    > that might cause a conflict between these esteemed principles.

    The reason why the UTC should tackle the encoding of Tengwar
    is not so much because it would help in the publication of Elvish
    poetry, but because confronting the architectural issues
    it poses for encoding would make an excellent tutorial case
    for how the two principles of combining mark order and
    logical order impact the task of coming up with an appropriate
    encoding for a complex script. And it would starkly illustrate
    the fact that an appropriate character encoding does not
    necessarily directly reflect the phonological structure of
    a language as represented by that script.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Aug 06 2003 - 19:01:24 EDT