From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Thu Jan 09 2003 - 08:15:57 EST
I'm forwarding this off-line email from Robert as I think it raises some
important issues about Tibetan encoding.
------- Start of forwarded message -------
From: "Robert R. Chilton" <acip@well.com>
Date: Wed, 08 Jan 2003 23:29:13 -0500
Cc: cfynn@gmx.net
Subject: Re: PRC asking for 956 precomposed Tibetan characters
To: "Andrew C. West" <andrewcwest@alumni.princeton.edu>
Andrew,
It should be mentioned that there are different normalized forms; I've
been referring, more or less, to "Normalization Form D" --which is the
form needed by processes that do searching and sorting. Normalization
Form D is essentially (at least for Tibetan) the maximum decomposition
of characters.
> 1. I've encoded glyphs with subjoined HA as the precomposed characters U+0F43
> etc. rather than decomposing them to U+0F42, U+0FB7 etc. Is this the correct
> normalized form ?
No, Normalization Form D applies canonical decomposition (indicated by
the three-bar equal-sign in the code chart) to characters. Thus, all
these precomposed characters with subjoined HA need to be decomposed.
> 2. The use of the precomposed long vowels with a-chung, U+0F73, U+0F75, U+0F77
> and U+0F79 is "discouraged" or "strongly discouraged" in the Unicode code
> charts, and so I have decomposed them to U+0F72, U+0F71 etc. Is this correct ?
Yes.
> 3. I've decomposed U+0F76 and U+0F78 to U+0FB2, U+0F80 and U+0FB3, U+0F80
> respectively. I'm not at all sure that it is correct to decompose these
> characters - what is your opinion ? And if I should not decompose U+0F76 and
> U+0F78, then should U+0F77 be decomposed to U+0F76, U+0F71, and U+0F79
> decomposed to U+0F78, U+0F71, even though no such equivalence is given in the
> Unicode code charts.
I think for the present purpose it is wise to decompose *all*
precomposed characters to their maximum decomposition. In this regard I
will contradict my earlier position regarding the triple (and double)
vowels E and O. The reason for my vacillation on this point is that
there is no canonical decomposition specified in the Unicode standard
for these two characters (U+0F7B and U+0F7D). Upon reflection, however,
I believe that a mistake was made in this regard and that these two
characters should have been recognized (for data processing purposes) as
*precomposed* characters and, further, that they should have been
deprecated with canonical decomposition to <U+0F7A, U+0F7A> and <U+0F7C,
U+0F7C>.
Unless I hear a good argument to the contrary, I will modify my own
collation tables and other materials so as to treat U+0F7B and U+0F7D as
precomposed characters that should be decomposed.
As I see it (and this applies also to double vowels E and O), the only
purpose for *any* of the precomposed characters in the Tibetan block is
to facilitate using the Tibetan script for representing and processing a
language other than Tibetan (e.g., Sanskrit). Here I am speaking with
regard to the level of encoding and not of glyph rendering /
presentation forms! Also, I am not speaking here of Indic
transliteration orthographies that are found in abundance in Tibetan
materials but rather of usages where the material in question is clearly
Indic and follows Indic rules of collation, etc. (Prime examples would
be: a Sanskrit-Tibetan dictionary written in Tibetan script but sorted
according to Sanskrit collation order or a full-text Sanskrit document
written out in Tibetan letters.)
Thus, for virtually all intents and purposes, *none* of the precomposed
Tibetan characters should be used (including U+0F00, U+0F7B and
U+0F7D). On a slightly different subject, processes should be on guard
against use of e.g., U+0F39 with U+0F45, etal. since such usage would
result in various data processing problems including apparently
incorrect searching and sorting.
Thank you for raising these important issues.
Kind regards,
Robert
------- End of forwarded message -------
This archive was generated by hypermail 2.1.5 : Thu Jan 09 2003 - 09:01:47 EST