Re: PRC asking for 956 precomposed Tibetan characters

From: Andrew C. West (
Date: Mon Jan 06 2003 - 06:45:18 EST

  • Next message: Michael Everson: "Re: A case for Tamil-X (k sh)"

    On Mon, 06 Jan 2003 01:46:44 -0800 (PST), "Robert R. Chilton" wrote:

    > Moreover, for the authors of n2558 to argue that a non-combining model
    > of Tibetan is necessary for compatibility with "traditional education,
    > publication and electronic desktop publishing systems" to is to entirely
    > discount the use of other complex scripts --such as the Indic scripts
    > which employ a combining model-- in such "systems". Clearly, the
    > direction of such a rationale runs entirely opposite to the basic
    > principles of Unicode/ISO-10646.

    Exactly. And as the underlying encoding should be opaque to the end user, it
    should make no difference to someone entering Tibetan text into an electronic
    desktop publishing system whether the system is encoding the syllable "rgya" as
    one character or three.

    > Such cases of triple (or quadruple) vowels E or O are best normalized to
    > double vowel plus single (or double) vowel to aid in collation and other
    > character data processing functions. Thus, Glyph 107 is best encoded as
    > (or normalized to) <U+0F41, U+0FB1, U+0F7B, U+0F7A>.

    My rationale for not normalising to double vowel plus single (or double) vowel
    is that a double vowel sign used to indicate a shorthand abbreviation is
    fundamentally different from a double vowel used to represent a long vowel. For
    instance, when the phrase "ki ki swo swo" is abbreviated to "Ka + double I" and
    "Swa + double O" the double I and double O vowels represent the contraction of
    two I syllables and O syllables respectively, and not a long I and long O vowel
    respectively. As there is no character for a double I vowel sign, then the
    double I vowel must needs be encoded as two consecutive I vowels. Although there
    is a double O vowel sign (U+0F7D), I think that encoding it in the same manner
    as the double I, as two consecutive O vowels, would be more consistent than
    encoding it with the graphically identical but semantically different double O
    vowel. By encoding it as two consecutive O vowels it is making an explicit
    statement that this is a shorthand abbreviation and not simply a long O.
    As to shorthand abbreviations with three or four identical vowel signs, what is
    the advantage of normalising to "vowel + double vowel" or "double vowel + double
    vowel" other than saving a few bytes ? I don't see how this would aid collation
    or other character data processing functions. Given that KHYA + triple E could
    legitimately be encoded as <U+0F41, U+0FB1, U+0F7B, U+0F7A>, <U+0F41, U+0FB1,
    U+0F7A, U+0F7B> or <U+0F41, U+0FB1, U+0F7A, U+0F7A, U+0F7A>, a good Tibetan font
    would have to map all three sequences to the same glyph. And from a collation
    point of view, why is any one of these sequences more helpful than another ? All
    three sequences would be collated after <U+0F41, U+0FB1, U+0F7A>. Admittedly
    only <U+0F41, U+0FB1, U+0F7B, U+0F7A> might be collated after <U+0F41, U+0FB1,
    U+0F7B>, but then as KHYEEE probably represents an abbreviation for KHYE KHYE
    KHYE, should it not be collated after KHYE rather than KHYEE ?
    In short, I believe that it is useful to encode shorthand abbreviations as a
    sequence of individual vowels so as to distinguish them from graphically
    identical long vowel syllables, and to make explicit their function as shorthand
    Nevertheless, I'm not terribly fussed about this, and am happy to follow the
    consensus of opinion.

    > Assuming that there have been no changes in the combining classes of
    > these characters since Unicode 3.0, the 2 characters <U+0F88> and
    > <U+0F89> are spacing, non-combining characters. Therefore, the only
    > possible encoding that will place the "base consonant" under these signs
    > (i.e., will result in these signs being "superfixed" to the letters KA,
    > KHA, PA, PHA, etal.) is for these characters to appear in the data
    > stream just prior to the "base consonant", such base consonant being
    > encoded in subjoined position. [It is not really correct to say that
    > "The Unicode Standard does not explicitly specify the coding sequence
    > for letters that are combined with any of the transliteration characters
    > U+0F88 through U+0F8B" since the combining class of the characters is
    > determinative.]
    > Thus, to encode Glyphs 029 and 100 use <U+0F88, U+0F90> and <U+0F88,
    > U+0F91>, respectively. Likewise, to encode Glyphs 435 and 486 use
    > <U+0F89, U+0FA4> and <U+0F89, U+0FA5>, respectively.

    Thanks for the explanation. I'm afraid my understanding of combining characters
    is rather hazy. I was mistakenly assuming that U+0F88 and U+0F89 were combing
    characters, and therefore encoding them after the base consonant in the same way
    that combining u-umlaut is encoded as <U+0075, U+0308>.
    I actually came up with the sequence <U+0F88, U+0F90> on my first attempt to
    encode Glyph 29, but I decided it must be wrong as I thought that a stack ought
    to have a base consonant to be valid. If what you are suggesting is that the
    characters U+0F88 through U+0F8B can behave as base consonants, then I guess I
    was right the first time. (Looking back at the Unicode Standard, I notice it
    states that a stack contains "at most one base consonant" and "any number of
    subjoined consonants", so a stack with no base consonant would be valid).

    > Note that these
    > latter two glyphs are *NOT* a case of superfixed TIBETAN MARK PALUTA but
    > rather a case of superfixed TIBETAN SIGN MCHU CAN. The PALUTA has a
    > different function (of transliterating the Sanskrit apostrophe in
    > Tibetan script) and is not found in superfixed position. [Note also
    > that a naive reader might mistake the TIBETAN SIGN MCHU CAN for a
    > superfixed NYA, just as one might confuse the NYA and the PALUTA.]

    Thanks for the correction. I'm afraid I've never seen a Paluta in action, and
    naively assumed that this what the superjoined sign was. Nor, I'm afraid, am I
    familiar with the signs at U+0F88 through U+0F8B.

    > Though I confess that I am not familiar with these orthographies, the
    > glyphs cited are cases of TIBETAN MARK TSA -PHRU [U+0F39] being affixed
    > to letters ZHA, ZA, and -A, respectively. They would be encoded as
    > <U+0F5E, U+0F39>, <U+0F5F, U+0F39> and <U+0F60, U+0F39>.

    I did wonder whether the mark was a TSA -PHRU, but in the document it looks
    dot-like rather than flag-like - perhaps at higher resolution it would be
    clearer. However, I still wonder what the TSA -PHRU signifies when added to
    these letters.

    > I hope this is useful.

    Very useful indeed. I'll update my web pages to reflect your comments as soon as


    This archive was generated by hypermail 2.1.5 : Mon Jan 06 2003 - 07:38:17 EST