L2/02-083
Response and Proposal for Khmer Encoding
By: Paul Nelson, 2 December 2001
During the 41st WG2 Meeting in Singapore, the
Cambodian government submitted an official objection to the existing Khmer
block in Unicode 3.0 and proposed a new encoding to replace the existing Khmer
block. This paper is intended to provide comments and a point of view for
discussion in response to the Cambodian documents and the Everson/Bauhahn
documents.
Response
- The
existing Khmer characters (consonants, independent vowels, dependent vowel
signs and other signs) must remain as they are currently encoded in
Unicode. This is critical to maintain order in the standard. While the
current encoding may not be ideal, it has been published nonetheless. It
is acknowledged that there may not have been official signoff by
representatives of the Cambodian government on the standard when it was
created a couple of years ago. History cannot be changed at this point. We
can only hope that there will be an open Cambodian participation in the
Khmer encoding process moving forward.
- Characters
proposed by the Cambodian delegation that are not included in the Khmer
Unicode block should be added if they do not duplicate characters which
can be generated by the current Unicode encoding mechanism. See the
proposal following.
- The
COENG encoding model should not be considered as being synonymous with the
�virama� model. The COENG encoding model does not encompass all of the
behaviors as the virama does for Indic languages. Therefore, it is
suggested that �COENG encoding model� should be used when speaking of
Khmer script usage. Any wording or semantics referring to the virama
should be removed from the Unicode standard when discussing Khmer script.
- The
existing COENG encoding model should be maintained for the scope in which
it provides a consistent and workable solution.
- It
is acknowledged that the COENG encoding model does require additional
size for storing documents.
- It
is acknowledged that the COENG encoding model may have slightly slower
performance for sorting and rendering text.
- It
is acknowledged that some people view the COENG encoding model as a
foreign convention that is being forced on the modern Khmer language
users. It is also acknowledged that some people may view the COENG model
as something that is a cultural assault on them as Cambodians.
- After
reflecting on the issue and weighing out the pros and cons, it seems that
items a. and b. above are not significant enough issues to require a
change to the COENG encoding model. The costs of moving from the COENG
encoding model to an encoded subscript model are: 1) the necessity of
invalidating all existing Khmer Unicode data and implementations, 2) the
necessity of deprecating the COENG character, 3) adding all of the
subscript characters, and 4) most critical the change would introduce a
destabilizing factor into the ISO and Unicode standards because others
would view this as a precedent to change other areas as well.
- The
COENG + KA combination is exactly the same as explicitly encoding the COENG
KA subscript form. In any place the proposed replacement encoding
represents and uses the encoded subscript form of a letter, it is equal
to use the existing Khmer Unicode standard to represent the subscript by
COENG + the base character from which that subscript form is derived.
i.
The COENG model requires the following consonant or
independent vowel to be �glued� to the COENG and be treated as a unit from that
time on. Places this �glued� combination are required include, but are not
limited to, rendering, collation/sorting, determining caret position, copying
and pasting text, etc.
ii.
The COENG letter combination functions as a diacritic or
combining mark to the base character.
- The
existing COENG encoding model does not correctly handling lunar dates. I
propose that the Lunar Date Symbols (LDS) proposed by the Cambodian
delegation be encoded in a new Extended Khmer block to be located near the
range of U+19E0 � U+19FF.� This
needed addition of Lunar Date Symbols is due to the different behavior of
lunar dates vis-�-vis consonants; that is, the LDS cannot be used within
the definition of the COENG encoding model.�
- The
COENG encoding model says that the vowel of the preceding consonant is
killed. This does not apply to lunar dates.
- The
COENG encoding model says that the following letter should be treated as
a subscript. The formation of the lunar date has the second number in a
subscript form. However, the preceding number is made into a smaller size
and put into a superscript form.
- Lunar
date symbols may have one or two digits above, or one or two digits
below. Having more than one digit in either position causes the COENG
encoding model to not work correctly. Therefore, the COENG model should
not be construed as also working for lunar dates.
- For
the COENG encoding model to handle the ROBAT, as contended by Bauhahn, an
exception to the definition of the COENG encoding model (4.b. above) is
required for this special case. This introduces an alternative manner in
which the ROBAT is encoded and causes issues with normalization and
canonical ordering. While the point that encoding the ROBAT using the
COENG encoding model in the order suggested by Bauhahn solves sorting
issues, it must be pointed out that that encoding and sorting are two
completely different concepts, and collation should not be improved or
fixed by suggesting changes to the repertoire. Thus, if encoding the ROBAT
as Bauhahn suggests is seriously considered, 1) the ROBAT character should
be deprecated so that only one method of forming the ROBAT remains and 2)
an exception to the COENG encoding model must be introduced. Input from
the Cambodian delegation is critical to correctly understanding this
issue.
Proposed Encoding Changes
The following charts include characters that should be added
to Unicode to support Khmer.
Abstract: In the process of originally encoding the
Khmer script, some commonly used characters were not encoded. It is proposed
that the characters listed be added to the current Khmer block to allow modern
Khmer documents to be created. The characters added are grouped into six areas.
- Additional
Diacritic Signs �
- Repeater
Sign �
- Divination
Lore Signs �
- Pali/Sanskrit
extending sign �
- Variant
Selector � The variant selector is used to resolve ambiguous cases where
the same letter may take different shapes.
- Lunar
Date Symbols -

Additional Diacritic Signs
17DD � KHMER SIGN ATTHACAN;Mn;0;NSM;;;;;N;;;;;
Repeater Sign
17DE � KHMER SIGN
LEKTO;Po;0;L;;;;;N;;;;;
Digit symbols for divination lore
17F0 � KHMER SYMBOL LEK ATTAK SON;Nd;0;L;;0;0;0;N;;;;;
17F1 � KHMER SYMBOL LEK ATTAK MUOY;Nd;0;L;;0;0;0;N;;;;;
17F2 � KHMER SYMBOL LEK ATTAK PII;Nd;0;L;;0;0;0;N;;;;;
17F3 � KHMER
SYMBOL LEK ATTAK BEI;Nd;0;L;;0;0;0;N;;;;;
17F4 � KHMER SYMBOL LEK ATTAK BUON;Nd;0;L;;0;0;0;N;;;;;
17F5 � KHMER SYMBOL LEK ATTAK PRAM;Nd;0;L;;0;0;0;N;;;;;
17F6 � KHMER SYMBOL LEK ATTAK PRAM-MUOY;Nd;0;L;;0;0;0;N;;;;;
17F7 � KHMER SYMBOL LEK ATTAK PRAM-PII;Nd;0;L;;0;0;0;N;;;;;
17F8 � KHMER
SYMBOL LEK ATTAK PRAM-BEI;Nd;0;L;;0;0;0;N;;;;;
17F9 � KHMER SYMBOL LEK ATTAK PRAM-BUON;Nd;0;L;;0;0;0;N;;;;;
Pali/Sanskrit extending sign
17FA � KHMER SIGN AVAKRAHA;Po;0;L;;;;;N;;;;;
Control Character
17FF � KHMER VARIANT SIGN;Cf;0;BN;;;;;N;;;;;
Khmer Extended -
The proposed Khmer Extended block includes lunar date
symbols that are used with Khmer. The proposed range is U+19E0 � U+19FF.
Sorting order � The sorting order of the Khmer Extended
block should be in the order of the Unicode characters. [this should be
confirmed or correct sort order given]
Typographical form of Khmer lunar dates � The typographical
form of Khmer lunar dates is a top and bottom section of the same size text.
The dividing line between the upper an lower half of the symbol is the vertical
center of the line height.

Lunar Date Symbols
19E0 � KHMER
SYMBOL PATHAMASAT;No;0;L;;0;0;0;N;;;;;
19E1 � KHMER
SYMBOL MUOY KOET;No;0;L;;0;0;0;N;;;;;
19E2 � KHMER
SYMBOL PII KOET;No;0;L;;0;0;0;N;;;;;
19E3 � KHMER
SYMBOL BEI KOET;No;0;L;;0;0;0;N;;;;;
19E4 � KHMER
SYMBOL BUON KOET;No;0;L;;0;0;0;N;;;;;
19E5 � KHMER
SYMBOL PRAM KOET;No;0;L;;0;0;0;N;;;;;
19E6 � KHMER
SYMBOL PRAM-MUOY KOET;No;0;L;;0;0;0;N;;;;;
19E7 � KHMER
SYMBOL PRAM-PII KOET;No;0;L;;0;0;0;N;;;;;
19E8 � KHMER
SYMBOL PRAM-BEI KOET;No;0;L;;0;0;0;N;;;;;
19E9 � KHMER
SYMBOL PRAM-BUON KOET;No;0;L;;0;0;0;N;;;;;
19EA � KHMER
SYMBOL DAP KOET;No;0;L;;0;0;0;N;;;;;
19EB � KHMER
SYMBOL DAP-MUOY KOET;No;0;L;;0;0;0;N;;;;;
19EC � KHMER
SYMBOL DAP-PII KOET;No;0;L;;0;0;0;N;;;;;
19ED � KHMER
SYMBOL DAP-BEI KOET;No;0;L;;0;0;0;N;;;;;
19EE � KHMER
SYMBOL DAP-BUON KOET;No;0;L;;0;0;0;N;;;;;
19EF � KHMER
SYMBOL DAP-PRAM KOET;No;0;L;;0;0;0;N;;;;;
19F0 � KHMER
SYMBOL TUTEYASAT;No;0;L;;0;0;0;N;;;;;
19F1 � KHMER SYMBOL MUOY ROC;No;0;L;;0;0;0;N;;;;;
19F2 � KHMER SYMBOL PII ROC;No;0;L;;0;0;0;N;;;;;
19F3 � KHMER SYMBOL BEI ROC;No;0;L;;0;0;0;N;;;;;
19F4 � KHMER SYMBOL BUON ROC;No;0;L;;0;0;0;N;;;;;
19F5 � KHMER SYMBOL PRAM ROC;No;0;L;;0;0;0;N;;;;;
19F6 � KHMER SYMBOL PRAM-MUOY ROC;No;0;L;;0;0;0;N;;;;;
19F7 � KHMER SYMBOL PRAM-PII ROC;No;0;L;;0;0;0;N;;;;;
19F8 � KHMER
SYMBOL PRAM-BEI ROC;No;0;L;;0;0;0;N;;;;;
19F9 � KHMER
SYMBOL PRAM-BUON ROC;No;0;L;;0;0;0;N;;;;;
19FA � KHMER
SYMBOL DAP ROC;No;0;L;;0;0;0;N;;;;;
19FB � KHMER
SYMBOL DAP-MUOY ROC;No;0;L;;0;0;0;N;;;;;
19FC � KHMER SYMBOL DAP-PII ROC;No;0;L;;0;0;0;N;;;;;
19FD � KHMER SYMBOL DAP-BEI ROC;No;0;L;;0;0;0;N;;;;;
19FE � KHMER SYMBOL DAP-BUON ROC;No;0;L;;0;0;0;N;;;;;
19FF � KHMER SYMBOL DAP-PRAM ROC;No;0;L;;0;0;0;N;;;;;
Compatibility Mappings
Compatibility mapping � [this section needs to be completed]
Are these considered as atomic units, or are they considered as being
compatible to some combination of numbers? This does not imply that they would
be decomposed or formed from decomposed forms. It does provide for some default
sorting behaviors.