L2/01-307

From: Kenneth Whistler [kenw@sybase.com]
Sent: Monday, August 06, 2001 5:56 PM

Subject: Serious bug in Khmer, Myanmar combining classes

WARNING. The following presentation contains explicit
linguistic material that may not be suitable for young
audiences.

Peter Constable noted the problem that Myanmar dependent
vowels i (U+102D) and u (U+102F) both have been given combining
class zero, but that they must be used in sequence to represent
the common Myanmar vowel ui. However, since i is a combining
mark above and u is a combining mark below, this leads to a
visual ambiguity -- either order could, in principle, lead to
the same visual rendering, but the two orders would not be
canonically equivalent, since combining marks of class zero do
not rearrange under normalization.

The short summary of the discussion which followed is:

Peter suggested fixing the combining classes to non-zeroes,
so the two sequences would be canonically equivalent (as would
be the case for a sequence of an accent above and an accent below a Latin
letter, for instance).

Mark said, we can't fix it -- it's against Unicode policy.

Rick said, it's an error, and the UTC should discuss it.

Mark said, the IETF and W3C would kill us if we tried to change the
combining classes. We can only document it.

Peter said implementations will end up having to do an ad hoc
kind of normalization, and that's a problem.

Here are some more gory details about Myanmar to scare the kiddies
with.

The basic pattern for Myanmar is as follows. The syllabic order
is ka kaa ki kii ku kuu ke kai ko koo kă kui. Spelled out in
Unicode, this is:

ka	1000
kaa	1000 102C
ki	1000 102D
kii	1000 102E
ku	1000 102F
kuu	1000 1030
ke	1000 1031
kai	1000 1032
ko	1000 1031 102C
koo	1000 1031 102C 1039
kă	1000 1036
kui	1000 102F 102D

The pattern is nice and clean except for ko, koo, and kui.

-o is a two-part vowel, with the -e piece to the left of the
consonant, and the -aa piece to the right of the consonant.
-oo adds a visual killer (1039, combining class 9) on top of
the -aa piece.

-ui is a two-part vowel, with the -u piece underneath the
consonant, and the -i piece on top of the consonant.

Alternatives were proposed and discussed ad nauseum before Myanmar
was finally agreed to. In particular, there were proposals that
had a single encoded character for each of the two-part vowels
(or some subset of them). The situation was complicated by the
mismatched pattern for the independent vowels, some of which are
written by cliticizing the dependent vowel to U+1021 'a' (which
behaves like an open syllable initial consonant placeholder --
or could be analyzed as a glottal stop, I suppose), and some of which
have distinct composite forms of their own.

In the end, things were horsetraded down to what we've got, and
like it or not, we are stuck with it now. (Note, for the record,
that the Myanmar participants agreed to the idea of encoding
-ui as a sequence of two characters, so this wasn't just something
foisted on them by glyph-oriented Westerners ignorant of the
vocalic pattern.)

Now consider the problem of "spelling" of the two-part vowels.
Peter points out the visual ambiguity. In principle, kui could
also be spelled 1000 102D 102F, instead of 1000 102F 102D, and
under an ordinary implementation, you wouldn't be able to tell
them apart. However, the problem is not so simple as above
and below pieces. If you look at the other two-part vowel, the
one with the left and right pieces, -o, the same ambiguity
exists, despite the fact that we are not talking about above
and below clitics. If one spelled ko 1000 102C 1031, a renderer
would still be faced with the problem that 1031 is defined as
rendering to the *left* of its consonant base, and 102C as
rendering to the right. One could argue that a dumb renderer
would end up positioning these as visually: [1000 1031 102C],
i.e. moving the 1031 around the 102C, but not around the consonant
ka, 1000, resulting in a visually incorrect display, so that
there would not be any visual ambiguity. But that just means that
a dumb renderer would get it wrong, whereas a renderer that checked
appropriately for the preceding consonant might, in fact, get
it right, resulting in visual ambiguity again.

In my opinion, in *both* of these instances, the right way
to proceed is to specify the correct order, and to characterize
the other order as a *spelling* error -- not as a canonicalization
error.

The way to eliminate the visual ambiguity in the -ui case is to
write a Myanmar renderer such that if it encounters the two
pieces of the -ui vowel in the wrong order, it displays them
visually wrong (intentionally), rather than quietly stacking
them as if they were spelled in the correct order. That will
give correct feedback for all of the potentially ambiguous
cases.

Furthermore, one would expect that Myanmar input methods would
provide single key access to all of the two-part vowels, in any
case, as for most Indic keyboarding systems. This will work to
help keep the -ui's and -o's correct in the underlying store.

Actually, rather than the -ui vowel issue, where having everything
assigned a combining class of zero still allows a consistent
way to implement the behavior desired, there is another issue
where I think the combining classes *are* clearly wrong, but
still cannot be fixed.

The issue is for U+1037, the aukmyit dot below. In Myanmar, this
is a tone mark, *not* a nukta. But it was given the combining
class of a nukta, i.e. 7. By itself, that would cause no harm,
but the problem is that 1037 comes in a pattern pair with
U+1038 MYANMAR SIGN VISARGA, which also behaves, in Myanmar,
as a tone mark. Thus we get tonal triples of the sort:

ang1	1021 1004 1039  ( a -nga -killer )

ang2	1021 1004 1039 1037

ang3	1021 1004 1039 1038

This is the order that I think makes the most linguistic sense,
where the killer is applied to the nga to create a final -ng
consonant, and then the tone marks, if any, are in logical order
following the killer. Visually, the dot below appears below the
-ng, and the visarga, a colon-shaped double dot, appears to the
right of the -ng (with the killer above the -ng).

The problem is that the combining class of the killer is 9, as
for all other halants (viramas), whereas the combining class of
the 1037 dot below is 7, and the combining class of the 1038
visarga is zero. That means that the representation of ang2
is not in canonical order, which would instead be:

ang2	1021 1004 1037 1039

whereas the representation of ang3 *is* in canonical order.
This assymetry of two otherwise parallel and very commonly
occurring forms under normalization is likely to create problems
for processing of Myanmar data.

The alternative would be to specify that the correct spelling
of tone marks applied to consonant-final syllables is
to place the tone marks *before* the syllable-final killer:

ang2	1021 1004 1037 1039
	   0    0    7    9

ang3	1021 1004 1038 1039
	   0    0    0    9

In this way, despite the mismatch in combining classes for 1037 and
1038, both of these expressions would be in canonical order, which
would bode better for systematic processing, despite the somewhat
counterintuitive notion of putting the tone mark in between the
final consonant and its killer. (In particular, for ang3, the 1039
killer would have to rearrange around the 1038 visarga, so that
it correctly appeared on top of the 1004 nga.)

What this is all pointing to, in my opinion, is that we are desperately
in need of implementation guidelines for Myanmar (and for Khmer) in
the same kind of detail as for Devangari, so that these ordering
issues and ambiguities can be nailed down in sufficient detail to
enable a text model of properly spelled Myanmar (and Khmer). Otherwise,
we will not be able to interchange text successfully. Or at least,
while the text itself could be interchanged, it would be spelled
by drastically different conventions -- and since for Indic scripts,
the "spellings" involve complicated interactions with the rendering
rules, a spelling that works for Renderer A might result in
illegible gibberish in Renderer B, which was assuming different
spelling conventions. That would fail the Unicode plain text
criteria for interoperability.

All my ruminations on this topic are gladly contributed to the
cause, but I think it is imperative that someone who actually
has implementation experience with Myanmar in a real system
take the lead on this.

--Ken


	1