Re: PRC asking for 956 precomposed Tibetan characters

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Thu Jan 09 2003 - 08:15:57 EST

  • Next message: Doug Ewell: "Re: PRC asking for 956 precomposed Tibetan characters"

    I'm forwarding this off-line email from Robert as I think it raises some
    important issues about Tibetan encoding.

    ------- Start of forwarded message -------

    From: "Robert R. Chilton" <acip@well.com>
    Date: Wed, 08 Jan 2003 23:29:13 -0500
    Cc: cfynn@gmx.net
    Subject: Re: PRC asking for 956 precomposed Tibetan characters
    To: "Andrew C. West" <andrewcwest@alumni.princeton.edu>

    Andrew,

    It should be mentioned that there are different normalized forms; I've
    been referring, more or less, to "Normalization Form D" --which is the
    form needed by processes that do searching and sorting. Normalization
    Form D is essentially (at least for Tibetan) the maximum decomposition
    of characters.

    > 1. I've encoded glyphs with subjoined HA as the precomposed characters U+0F43
    > etc. rather than decomposing them to U+0F42, U+0FB7 etc. Is this the correct
    > normalized form ?

    No, Normalization Form D applies canonical decomposition (indicated by
    the three-bar equal-sign in the code chart) to characters. Thus, all
    these precomposed characters with subjoined HA need to be decomposed.

    > 2. The use of the precomposed long vowels with a-chung, U+0F73, U+0F75, U+0F77
    > and U+0F79 is "discouraged" or "strongly discouraged" in the Unicode code
    > charts, and so I have decomposed them to U+0F72, U+0F71 etc. Is this correct ?

    Yes.

    > 3. I've decomposed U+0F76 and U+0F78 to U+0FB2, U+0F80 and U+0FB3, U+0F80
    > respectively. I'm not at all sure that it is correct to decompose these
    > characters - what is your opinion ? And if I should not decompose U+0F76 and
    > U+0F78, then should U+0F77 be decomposed to U+0F76, U+0F71, and U+0F79
    > decomposed to U+0F78, U+0F71, even though no such equivalence is given in the
    > Unicode code charts.

    I think for the present purpose it is wise to decompose *all*
    precomposed characters to their maximum decomposition. In this regard I
    will contradict my earlier position regarding the triple (and double)
    vowels E and O. The reason for my vacillation on this point is that
    there is no canonical decomposition specified in the Unicode standard
    for these two characters (U+0F7B and U+0F7D). Upon reflection, however,
    I believe that a mistake was made in this regard and that these two
    characters should have been recognized (for data processing purposes) as
    *precomposed* characters and, further, that they should have been
    deprecated with canonical decomposition to <U+0F7A, U+0F7A> and <U+0F7C,
    U+0F7C>.

    Unless I hear a good argument to the contrary, I will modify my own
    collation tables and other materials so as to treat U+0F7B and U+0F7D as
    precomposed characters that should be decomposed.

    As I see it (and this applies also to double vowels E and O), the only
    purpose for *any* of the precomposed characters in the Tibetan block is
    to facilitate using the Tibetan script for representing and processing a
    language other than Tibetan (e.g., Sanskrit). Here I am speaking with
    regard to the level of encoding and not of glyph rendering /
    presentation forms! Also, I am not speaking here of Indic
    transliteration orthographies that are found in abundance in Tibetan
    materials but rather of usages where the material in question is clearly
    Indic and follows Indic rules of collation, etc. (Prime examples would
    be: a Sanskrit-Tibetan dictionary written in Tibetan script but sorted
    according to Sanskrit collation order or a full-text Sanskrit document
    written out in Tibetan letters.)

    Thus, for virtually all intents and purposes, *none* of the precomposed
    Tibetan characters should be used (including U+0F00, U+0F7B and
    U+0F7D). On a slightly different subject, processes should be on guard
    against use of e.g., U+0F39 with U+0F45, etal. since such usage would
    result in various data processing problems including apparently
    incorrect searching and sorting.

    Thank you for raising these important issues.

    Kind regards,
    Robert

    ------- End of forwarded message -------



    This archive was generated by hypermail 2.1.5 : Thu Jan 09 2003 - 09:01:47 EST