Fwd: Re: PRC asking for 956 precomposed Tibetan characters

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Tue Jan 07 2003 - 05:50:04 EST

Next message: Andrew C. West: "Re: PRC asking for 956 precomposed Tibetan characters"

Previous message: Michael Everson: "Re: A case for Tamil-X (k sh)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

------- Start of forwarded message -------

From: "Robert R. Chilton" <acip@well.com>
Date: Tue, 07 Jan 2003 00:20:01 -0500
Cc: unicode@unicode.org, tibex@unicode.org
Subject: Re: PRC asking for 956 precomposed Tibetan characters
To: "Andrew C. West" <andrewcwest@alumni.princeton.edu>

Andrew C. West wrote:
>
> On Mon, 06 Jan 2003 01:46:44 -0800 (PST), "Robert R. Chilton" wrote:
>
> ...
>
> > Such cases of triple (or quadruple) vowels E or O are best normalized to
> > double vowel plus single (or double) vowel to aid in collation and other
> > character data processing functions. Thus, Glyph 107 is best encoded as
> > (or normalized to) <U+0F41, U+0FB1, U+0F7B, U+0F7A>.
> >
>
> My rationale for not normalising to double vowel plus single (or double) vowel
> is that a double vowel sign used to indicate a shorthand abbreviation is
> fundamentally different from a double vowel used to represent a long vowel. For
> instance, when the phrase "ki ki swo swo" is abbreviated to "Ka + double I" and
> "Swa + double O" the double I and double O vowels represent the contraction of
> two I syllables and O syllables respectively, and not a long I and long O vowel
> respectively. As there is no character for a double I vowel sign, then the
> double I vowel must needs be encoded as two consecutive I vowels. Although
there
> is a double O vowel sign (U+0F7D), I think that encoding it in the same manner
> as the double I, as two consecutive O vowels, would be more consistent than
> encoding it with the graphically identical but semantically different double O
> vowel. By encoding it as two consecutive O vowels it is making an explicit
> statement that this is a shorthand abbreviation and not simply a long O.
> As to shorthand abbreviations with three or four identical vowel signs, what is
> the advantage of normalising to "vowel + double vowel" or "double vowel +
double
> vowel" other than saving a few bytes ? I don't see how this would aid collation
> or other character data processing functions. Given that KHYA + triple E could
> legitimately be encoded as <U+0F41, U+0FB1, U+0F7B, U+0F7A>, <U+0F41, U+0FB1,
> U+0F7A, U+0F7B> or <U+0F41, U+0FB1, U+0F7A, U+0F7A, U+0F7A>, a good Tibetan
font
> would have to map all three sequences to the same glyph. And from a collation
> point of view, why is any one of these sequences more helpful than another ?
All
> three sequences would be collated after <U+0F41, U+0FB1, U+0F7A>. Admittedly
> only <U+0F41, U+0FB1, U+0F7B, U+0F7A> might be collated after <U+0F41, U+0FB1,
> U+0F7B>, but then as KHYEEE probably represents an abbreviation for KHYE KHYE
> KHYE, should it not be collated after KHYE rather than KHYEE ?
> In short, I believe that it is useful to encode shorthand abbreviations as a
> sequence of individual vowels so as to distinguish them from graphically
> identical long vowel syllables, and to make explicit their function as
shorthand
> abbreviations.
> Nevertheless, I'm not terribly fussed about this, and am happy to follow the
> consensus of opinion.

I understand your interest in preserving the semantic or lexical
distinction between an instance of a contracted series of single vowels
and a true usage of the double vowel. However, the procedure of
normalization is designed to collapse all the variant encodings for a
particular presentation form into a single, "normalized" encoding. Take
for example the Sanskrit vowel long-r which is romanized as a letter r
with a macron over and a dot under. Without normalization this
presentation form could be encoded either as r+macron+dot-under or as
r+dot-under+macron. A problem comes in data processing (searching,
sorting, etal.) in that what appears on screen (the presentation form)
is identical for both encodings yet a search for "r+macron+dot-under"
will not find instances of "r+dot-under+macron".

Canonical combining classes are defined for combining characters (such
as macron and dot-under, or the vowel signs of Tibetan) in order to
support normalization of identical presentation forms to a single
encoding. So in the cases you cite, of "graphically identical but
semantically different" instances, consistency in searching, sorting,
etc. requires that all "graphically identical" presentation forms be
normalized to a single normalized encoding.

At the risk of adding further confusion, perhaps it is useful to mention
at this point that there are two errors in the assignment of canonical
combining class to characters in the Tibetan block: TIBETAN SIGN RJES
SU NGA RO [U+0F7E] and TIBETAN MARK HALANTA [U+0F84]. These two
characters should have been assigned a high enough combining class that
will cause them to be normalized to a position following any vowel
signs.

The erroneous combining class of 0 (zero) assigned to TIBETAN SIGN RJES
SU NGA RO [U+0F7E] is particulary troublesome since RJES SU NGA RO
[U+0F7E] is closely related to, and in some cases interchangeable with,
TIBETAN SIGN NYI ZLA NAA DA [U+0F82] and TIBETAN SIGN SNA LDAN
[U+0F83]--these latter two being assigned a (correct) combining class of
230.

As demonstrated in the table below, although the various instances of
Tibetan syllable HUUNG [H'Um according to ACIP romanization] written
using the TIBETAN SIGN SNA LDAN will normalize to a single sequence, the
same cannot be said for the various instances of syllable HUUNG written
using the TIBETAN SIGN RJES SU NGA RO.

Variant encodings of HUUNG Normalization Form D Status
<U+0F67,U+0F71,U+0F74,U+0F83> <U+0F67,U+0F71,U+0F74,U+0F83> OK
<U+0F67,U+0F83,U+0F75> <U+0F67,U+0F71,U+0F74,U+0F83> OK
<U+0F67,U+0F83,U+OF74,U+0F71> <U+0F67,U+0F71,U+0F74,U+0F83> OK

<U+0F67,U+0F71,U+0F74,U+0F7E> <U+0F67,U+0F71,U+0F74,U+0F7E> OK
<U+0F67,U+0F7E,U+0F75> <U+0F67,U+0F7E,U+0F71,U+0F74> PROBLEM
<U+0F67,U+0F7E,U+0F74,U+0F71> <U+0F67,U+0F7E,U+0F71,U+0F74> PROBLEM

[Please refer to Unicode Technical Report #15: Unicode Normalization
Forms for more information on normalization.]

I hope this is helpful.

Kind regards,
Robert Chilton

------- End of forwarded message -------

Next message: Andrew C. West: "Re: PRC asking for 956 precomposed Tibetan characters"
Previous message: Michael Everson: "Re: A case for Tamil-X (k sh)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jan 07 2003 - 06:41:07 EST