Re: Combining class for Thai characters

From: Peter_Constable@sil.org
Date: Thu May 23 2002 - 03:28:01 EDT


On 05/21/2002 10:07:32 AM Samphan Raruenrom wrote:

>I have something to consult with you about the properties of Thai
>characters in Unicode...

>The (below-attached) tone marks "MAI EK, THO, TRI, CHATTAWA" have
combining
>class 107

That's "above-attached", of course (simply a typo).

>My first question is :-
>Why the above-attached vowel signs/marks all have combining class 0?

I'm not positive on the history, but here's my take: As you mention, there
is a sequencing constraint in WTT. In an earlier version of the Unicode
standard (prior to 2.1) all of the Thai characters of category Mn had
fixed-position classes. I'm guessing that that was influenced by a notion
of there needing to be a specific order, as in WTT. It didn't really
accomplish anything to have all the different fixed position classes,
though. If anything, it created some complications, which I won't
elaborate on. At any rate, between 2.0 and 3.0, a lot of fixed-position
classes, both for Thai and for other scripts, were simplified. In so
doing, many were set to 0.

My opinion is that they should have been simplified, but that setting the
bulk of them to 0 was a mistake and creates some significant problems
(which go a step beyond the questions you raise here). I think they should
have been simplified in line with the final suggestion you make: have
those that interact typographicallay have the same class. (I'd say the
same of many other combining marks in a number of other scripts.)

>This inhibits them from participating in normalizations, right?

Well, it's not clear what you mean by that. Having them set to combining
class 0 means that they do not re-order when performing canonical
ordering, and so they are already in canonical order, hence in normal form
(except that in NFKD and NFKC there is the compatibility decomposition of
sara am).

>Examples :-
>The sequences (both of which should look the same on non-WTT shaping
engine) :-
>(1) KO KAI + SARA UU + MAI EK -> ¡Ùè -> combining class = 0, 103, 107
>(2) KO KAI + MAI EK + SARA UU -> ¡èÙ -> combining class = 0, 107, 103
>
>While Unicode doesn't have the notion of invalid sequence, Thai has one,
>defined by a
>national standard (WTT) to be (approximately) :
>CONSONANT + (above or below) VOWEL SIGN + TONE MARK or THANTHAKHAT
>
>The same concept occurs in, for example, Devanagari...

It's important to understand two things:

i) Just because a rule applies to the encoding of Devanagari in Unicode,
that does not mean the rule therefore necessarily applies to any other
script in Unicode.

ii) Just because a rule applies to the encoding of Thai in a legacy
encoding standard, that does not mean the rule therefore necessarily apply
to encoding of Thai script in Unicode.

In spite of any sequencing constraints on Devanagari in Unicode or on Thai
in WTT, the two Unicode character sequences that you cited above are both
valid representations of the same thing. More precisely, they are by
definition canonically equivalent, and they have the same normalised
represenatations. Either can occur in data, and they should be rendered
identically, and in general processes should treat them as
indistinguishable. (That's slightly strong, since there are special
situations, e.g. in normalising, when a process should distinguish them.
The relevant conformance requirement is that no conformant process can
assume they are distinct.)

>So (correct me if I'm wrong) the notion of invalid sequence in Unicode is

>script-specific.

Yes, but be careful of misinterpreting combining classes as saying
anything about what is or isn't a valid sequence -- they say absolutely
nothing in that regard.

>And it is (is it?) intended that the normalized sequences should (as much
as
>possible?)
>be correct for the particular scripts; otherwise, the normalized text
will be
>rendered
>differently from the un-normalized text (do they have to?).

You've got too many alternative readings in your sentence to know how to
answer. Let me respond in reference to what I commented on above: the two
example sequences you gave are canonically equivalent, and should be
rendered the same. The first is in canonical order (hence in normal form
for any of NFC, NFD, NFKC, NFKD), while the second is not, but that is not
really relevant with regard to their rendering: both should be presented
the same way. It is *not* true that normalised text will necesssarily be
rendered different from non-normalised text.

>This works for the above sequences, both (1) and (2) normalized to (1).
>But for the following sequences :-
>(3) KO KAI + SARA II + MAI EK -> ¡Õè -> combining class = 0, 0,
107
>(4) KO KAI + MAI EK + SARA II -> ¡èÕ -> combining class = 0, 107,
0
>
>They should both be normalized to (3) but not, because class 0 does not
>participate in reordering (they are both normalized).

I agree that no reordering occurs in canonical ordering because sara ii
has a class of 0, but I disagree that they *should* have the same
normalised representation. It seems to me you are making that assumption
because you are applying the lens of WTT, which is biased specifically in
relation to one particular language: Standard Thai. The script can be, and
is, used for writing other languages, and in principle another language
may have different requirements for combining mark combinations. I
personally think that mai eek and sara ii should have the *same* combining
class. But that's immaterial at this point since the fact is that they do
not, nor is UTC willing to change them so that they have the same
combining class.

>It's possible to correct this by
>assigning
>above-attaced vowel signs (i.e. SARA II) with combining class more than
0.

I'm assuming you mean to assign sara ii with a combining class > 0 and <>
107. I think that would be the wrong thing to do. But, that's also
immaterial since at this point, the stability requirements prohibit the
combining class of sara ii from being changed at all.

>Or, according to the Unicode (and Thai) convention that order below marks

>before above
>marks, the combining class of above vowels should be more than 103 (below

>vowels) and
>less than 107 (tone marks, which always above-attached).

Neither a good idea, I think, nor possible.

>Or if it's intended that the above vowel and tone mark should be stacked
>according
>to the Unicode default inside-out rule, both should have the same
combining
>class 107
>to let them interact typograhically.

That is exactly what I think *should* have been done. If I had my way,
we'd change it to that. But UTC will not make such a change at this point
due to a commitment not to alter normalised representations from version
3.0. We are stuck with the vowels that position above having combining
classes of 0, for better or worse.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Thu May 23 2002 - 01:45:36 EDT