Fwd: Re: PRC asking for 956 precomposed Tibetan characters

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Tue Jan 07 2003 - 05:50:04 EST

  • Next message: Andrew C. West: "Re: PRC asking for 956 precomposed Tibetan characters"

    ------- Start of forwarded message -------

    From: "Robert R. Chilton" <acip@well.com>
    Date: Tue, 07 Jan 2003 00:20:01 -0500
    Cc: unicode@unicode.org, tibex@unicode.org
    Subject: Re: PRC asking for 956 precomposed Tibetan characters
    To: "Andrew C. West" <andrewcwest@alumni.princeton.edu>

    Andrew C. West wrote:
    > On Mon, 06 Jan 2003 01:46:44 -0800 (PST), "Robert R. Chilton" wrote:
    > ...
    > > Such cases of triple (or quadruple) vowels E or O are best normalized to
    > > double vowel plus single (or double) vowel to aid in collation and other
    > > character data processing functions. Thus, Glyph 107 is best encoded as
    > > (or normalized to) <U+0F41, U+0FB1, U+0F7B, U+0F7A>.
    > >
    > My rationale for not normalising to double vowel plus single (or double) vowel
    > is that a double vowel sign used to indicate a shorthand abbreviation is
    > fundamentally different from a double vowel used to represent a long vowel. For
    > instance, when the phrase "ki ki swo swo" is abbreviated to "Ka + double I" and
    > "Swa + double O" the double I and double O vowels represent the contraction of
    > two I syllables and O syllables respectively, and not a long I and long O vowel
    > respectively. As there is no character for a double I vowel sign, then the
    > double I vowel must needs be encoded as two consecutive I vowels. Although
    > is a double O vowel sign (U+0F7D), I think that encoding it in the same manner
    > as the double I, as two consecutive O vowels, would be more consistent than
    > encoding it with the graphically identical but semantically different double O
    > vowel. By encoding it as two consecutive O vowels it is making an explicit
    > statement that this is a shorthand abbreviation and not simply a long O.
    > As to shorthand abbreviations with three or four identical vowel signs, what is
    > the advantage of normalising to "vowel + double vowel" or "double vowel +
    > vowel" other than saving a few bytes ? I don't see how this would aid collation
    > or other character data processing functions. Given that KHYA + triple E could
    > legitimately be encoded as <U+0F41, U+0FB1, U+0F7B, U+0F7A>, <U+0F41, U+0FB1,
    > U+0F7A, U+0F7B> or <U+0F41, U+0FB1, U+0F7A, U+0F7A, U+0F7A>, a good Tibetan
    > would have to map all three sequences to the same glyph. And from a collation
    > point of view, why is any one of these sequences more helpful than another ?
    > three sequences would be collated after <U+0F41, U+0FB1, U+0F7A>. Admittedly
    > only <U+0F41, U+0FB1, U+0F7B, U+0F7A> might be collated after <U+0F41, U+0FB1,
    > U+0F7B>, but then as KHYEEE probably represents an abbreviation for KHYE KHYE
    > KHYE, should it not be collated after KHYE rather than KHYEE ?
    > In short, I believe that it is useful to encode shorthand abbreviations as a
    > sequence of individual vowels so as to distinguish them from graphically
    > identical long vowel syllables, and to make explicit their function as
    > abbreviations.
    > Nevertheless, I'm not terribly fussed about this, and am happy to follow the
    > consensus of opinion.

    I understand your interest in preserving the semantic or lexical
    distinction between an instance of a contracted series of single vowels
    and a true usage of the double vowel. However, the procedure of
    normalization is designed to collapse all the variant encodings for a
    particular presentation form into a single, "normalized" encoding. Take
    for example the Sanskrit vowel long-r which is romanized as a letter r
    with a macron over and a dot under. Without normalization this
    presentation form could be encoded either as r+macron+dot-under or as
    r+dot-under+macron. A problem comes in data processing (searching,
    sorting, etal.) in that what appears on screen (the presentation form)
    is identical for both encodings yet a search for "r+macron+dot-under"
    will not find instances of "r+dot-under+macron".

    Canonical combining classes are defined for combining characters (such
    as macron and dot-under, or the vowel signs of Tibetan) in order to
    support normalization of identical presentation forms to a single
    encoding. So in the cases you cite, of "graphically identical but
    semantically different" instances, consistency in searching, sorting,
    etc. requires that all "graphically identical" presentation forms be
    normalized to a single normalized encoding.

    At the risk of adding further confusion, perhaps it is useful to mention
    at this point that there are two errors in the assignment of canonical
    combining class to characters in the Tibetan block: TIBETAN SIGN RJES
    SU NGA RO [U+0F7E] and TIBETAN MARK HALANTA [U+0F84]. These two
    characters should have been assigned a high enough combining class that
    will cause them to be normalized to a position following any vowel

    The erroneous combining class of 0 (zero) assigned to TIBETAN SIGN RJES
    SU NGA RO [U+0F7E] is particulary troublesome since RJES SU NGA RO
    [U+0F7E] is closely related to, and in some cases interchangeable with,
    [U+0F83]--these latter two being assigned a (correct) combining class of

    As demonstrated in the table below, although the various instances of
    Tibetan syllable HUUNG [H'Um according to ACIP romanization] written
    using the TIBETAN SIGN SNA LDAN will normalize to a single sequence, the
    same cannot be said for the various instances of syllable HUUNG written

    Variant encodings of HUUNG Normalization Form D Status
    <U+0F67,U+0F71,U+0F74,U+0F83> <U+0F67,U+0F71,U+0F74,U+0F83> OK
    <U+0F67,U+0F83,U+0F75> <U+0F67,U+0F71,U+0F74,U+0F83> OK
    <U+0F67,U+0F83,U+OF74,U+0F71> <U+0F67,U+0F71,U+0F74,U+0F83> OK

    <U+0F67,U+0F71,U+0F74,U+0F7E> <U+0F67,U+0F71,U+0F74,U+0F7E> OK
    <U+0F67,U+0F7E,U+0F75> <U+0F67,U+0F7E,U+0F71,U+0F74> PROBLEM
    <U+0F67,U+0F7E,U+0F74,U+0F71> <U+0F67,U+0F7E,U+0F71,U+0F74> PROBLEM

    [Please refer to Unicode Technical Report #15: Unicode Normalization
    Forms for more information on normalization.]

    I hope this is helpful.

    Kind regards,
    Robert Chilton

    ------- End of forwarded message -------

    This archive was generated by hypermail 2.1.5 : Tue Jan 07 2003 - 06:41:07 EST