Khmer Subscripts: encode directly or no

From: Maurice J Bauhahn (
Date: Fri Aug 15 1997 - 08:57:45 EDT

It would be good to summarize some of the issues facing Khmer encoding of
subscripts so as to come to an informed decision:

On first glance one would say encode subscripts directly (do not couple
with another code and the original consonant). In fact originally I held
that viewpoint...for around three years. There are hidden issues which
have made direct encoding unattractive. A summary of the strengths and
weaknesses of the two approaches are as follows:



More intuitive to Khmers

Encoded subscript could also be displayed subscript (most of the time)
  (relative simplicity; reduced display intelligence)

Not like ISCII;-)


Takes many more codes (exceeding the ability of 7 bits [127 code spaces]
  to express; standards bodies dislike). Would not work well with
  Apple WorldScript/translation to Unicode).

Still not adequate to display KHMER LETTER RAW as a subscript (and several
  lesser used subscripts; hence not absolutely simple display still
  impossible...needs display intelligence still)

Makes it difficult to have a comprehensive encoding of all subscripts
  (independent vowels as subscripts)



Allows standard encoding of rare, unusual subscripts without knowing all
  of them in advance (necessary for historical Khmer text and
  transliteration of Indian languages)

Easy portability between 7 bit encoding (Apple WorldScript) and Unicode

Uses fewer coding spaces...keeps standards bodies happy.


Requires display intelligence to display every subscript (it has to be
  there in any case, why not use it?)

Syntax of Unicode requires that escape code follow character it modifies
  (however ISCII VIRAMA has passed muster which has similar semantics
  to Khmer; code placement is the same whether we think of it as coming
  before the subscript or after the preceding consonant/subscript)

In sum, two code encoding of subscripts allows Khmer to be encoded more
thoroughly than direct (one code) encoding of subscripts. To me the
preferable decision is obvious: two code encoding.

Maurice Bauhahn

