Re: Tibetan/Burmese/Khmer

From: Maurice J Bauhahn (
Date: Fri Jan 17 1997 - 06:38:55 EST

The Cambodian script problem is similar to those of other indic scripts
described on the UNICODE mailing list. It is evident that positions have
at times taken on a religious flavor. I trust that the outcomes of these
discussions will have the following results in what I feel is a proper
long term order of priority: (1) Data from the script will be searchable
and sortable at maximum speed (and minimal ambiguity), (2) Data entry
speed will be maximized (in a 'natural' order, minimal but complete
character set), (3) It will be economically feasible for the script to
be used in future software, and (4) Display will be of the highest quality
while easy to read (probably a non-issue for Unicode since we are
separating display from encoding).

The implications of some of these other discussions for Khmer are
presented below.

>>Is the implication that Unicode/ISCII-type encoding will suit Burmese?
>I certainly think so. The only issue is whether vowels should be stored
>in phonetic order as in Devanagari or in visual order as in Thai. As far
>as the UI goes, teachers want students to type in the same order that
>they learn to write, which is visual. This means that if the backing
>store is phonetic, implementations need to provide an input method to
>reorder from visual to phonetic.

It is interesting that Khmer writing order is _largely_ in the phonetic
order even when that requires backing up before a base consonant. (The
exception to this appears to be when TWO out-of-phonetic-order glyphs
precede a base consonant. Then the first of the TWO is written in visual
order and the second of the TWO is written in phonetic order. When there
is no (initial) subjoined 'rho'[the second out-of-phonetic-order glyph
mentioned above], then that same vowel [of which there are three
different possibilities, written in visual order above] would instead be
written in phonetic order). Phonetic ordering of Khmer is consistent with
the spelling order which is phonetic. Apparently this contrasts with
Thai(?). I was surprized to see in undeletable (!!) Unicode 2.0 0E70,
0E71, 0E72, 0E73, and 0E74 of the Unicode1.0 Thai encoding were eliminated
(These were called Phonetic order clones of left side vowel signs). Visual
order encoding must lengthen sort order duration. In Khmer this is
particularly acute because three different glyphs that surround a base
consonant may make up one vowel character.

> Lee Collins said:
> >The same argument could be made for Burmese, however, I spent 2 weeks in
> >Burma working with Burmese standards, language experts, and school
> >teachers to design a good UI for Burmese typing. I started with a
> >protoype of Burmese on the Mac that used the virama. All new users
> >actually preferred the virama implementation to having to learn
> >additional key positions for subscripts.
> Is the implication that Unicode/ISCII-type encoding will suit Burmese?

In a similar vein, however, I have made a dead-key based keyboard for
Khmer which places all subscripts on the same keys as their corresponding
base consonant (no need to learn additional key positions).

If Khmer was substituted on to an ISCII encoding, several consonants would
need to be added out of alphabetic order and some vowels added. For speed
of sorting that would be disadvantageous...but if economic considerations
made it necessary, Cambodians would have to put up with the
>>>I would argue that the subscripts complicate the sorting.
> >You can generate the proper subscript weight with a virama.
> Of course sorting in Tibetan is a nightmare, being dependent on the root of
> the syllable, not, strictly speaking, on the order of the characters in the
> data store.
One _can_ put proper weight to subscripts in this way...however at
additional time cost. I envision helping Cambodians to sort
millions of strings on their computers, and am fearful of the
implications of numerous compromises that reduce the efficiency of
sorting. Sorting in Khmer is similarly dependent on the root of the
syllable (thank you, Michael, for putting it that makes it more
understandable to the uninitiated). In Khmer there are five different
weightings within a syllable: base consonant (or implied glottal stop
consonant), first subscript consonant, second subscript consonant, vowel,
and sign. It will be nice with Unicode to combine all the vowel glyphs
combinations into one character!

Rich McGowan mentions:

> the Unicode encoding also has the advantage of delineating stack
> boundaries nicely without "n-character look-around".

Please elaborate what this means. I understand it to mean that if
characters are encoded phonetically then (with Khmer) the stack (consonant
cluster) would always begin with a consonant or an independent vowel.

Mike Forgey asks and Rich responds:

>> Are the Tibetan subjoined characters considered to be equal to the
>> nominal form preceded by VIRAMA; i.e., 0F90 = 0F84 + 0F40?
>Uh, probably the answer is "NO". Don't encode with virama unless you
>mean to provide something that has an "abnormal" spelling for some
>specific effect.

Hence in a Khmer encoding (even if the subjoined characters were encoded
separately) abnormal subscripts could still be created by a VIRAMA and
base character combination encoding? This would be limited to the one
consonant which does not have a subscript form and independent vowels.
Rarely, oh so rarely, these turn up as subscripts in ancient manuscripts.

Maurice Bauhahn

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT