Re: Tibetan/Burmese/Khmer

From: Martin J. Duerst (
Date: Mon Jan 20 1997 - 06:03:54 EST

On Sat, 18 Jan 1997 Maurice Bauhahn wrote:

> Thank you Michael for the information you passed on. Thank you for
> disclosing at what the stage Thai encoding was changed.

I have recently come up with a hypothesis that could explain some
of the basic working in Thai that favorises the "glyph-based" encoding
now in Unicode/ISO10646: The fact that Thai is an isolating language,
not having declinations/conjugations, could mean that as long as the
syllable has a well-defined encoding,

> I wish I could calculate the theoretical limits to settle that question.
> All I know is the difficulty which I have experienced in creating a
> sorting algorythm for the language. There are five levels of dependencies
> with up to 35 members in each dependency. Of course the real language does
> not have all combinations but the variations are enough that a simple
> dictionary lookup does not seem practical.

Do those things you call "levels" work similar to the following things
in sorting Latin:
- Base letters
- Accents
- Case
I.e. you only start to consider accents if two words are completely
equal with respect to base letters, or you only start to check out
subjoined consonants in comparing two words if the two words are
identical with respect to plain consonants?
> I would love to do that, but have no idea how to incorporate Khmer script
> examples into HTML without a bunch of little giffs! Are there any working
> browsers which take advantage of the proposed RFC2070
> (Internationalization of the Hypertext Markup Language)?

As one of the authors of RFC 2070, I would be very happy to offer a neat
solution. But it's a chicken-and-egg problem. You cannot discuss encoding
of a script and already assume an encoding. So please use inline bitmaps,
aka GIFs. This is actually suggested in RFC 2070, at the end of section
2.2 :-).

> > >In Khmer there are five different
> > >weightings within a syllable: base consonant (or implied glottal stop
> > >consonant), first subscript consonant, second subscript consonant, vowel,
> > >and sign. It will be nice with Unicode to combine all the vowel glyphs
> > >combinations into one character!

For a single syllable, this should work anyway, because it will contain
the codepoints in the order you give. Having separate codepoints for
subjoined consonants or having them as virama+consonant will not change

> > None of this sounds like "root" in the sense in which Tibetan uses the term.
> Please post a URL to a document which describes what 'root' does mean when
> refering to Tibetan.

I'm not an expert in Tibetan, but to give you a very rough idea,
take English words like "know", "knife", "psyche",.... Here,
"n" or "s" would be the root, not "k" or "p". In Tibetan, consonants
before the root can change how the root is pronounced, or maybe
may be pronounced themselves in some dialects or in old times.
There are grammatical rules to find out which letter is the
root, but they are quite complex.

Regards, Martin.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT