Re: Fixed position combining classes (Was: Combining class for Thai characters)

From: Samphan Raruenrom (samphan@thai.com)
Date: Sun Jun 02 2002 - 06:40:05 EDT


Hi :)

Thank you for the invaluable reply and sorry for my confusing English.
I'll try to be as clear as possible in the future. I'm not good at
English, especially at using the apropriate level (polite/aggressive)
of language for particular meaning.
I'm learning about Unicode and love it every much. The problem is that
I only have experiences with processing Thai. So all of my comments
are actually questions. Please add "correct me if I'm wrong" to all of
them.
I'll throw in related data from the Unicode website/book to make it
clear for others in the discussions, which you can see in the Ccs.
Please use Reply All so everyone will get it.

Peter_Constable@sil.org wrote:
> On 05/21/2002 10:07:32 AM Samphan Raruenrom wrote:
>>Why the above-attached vowel signs/marks all have combining class 0?
> I'm not positive on the history, but here's my take: As you mention, there
> is a sequencing constraint in WTT. In an earlier version of the Unicode
> standard (prior to 2.1) all of the Thai characters of category Mn had
> fixed-position classes. I'm guessing that that was influenced by a notion
> of there needing to be a specific order, as in WTT.

This is what I've guessed too.

>>So (correct me if I'm wrong) the notion of invalid sequence in Unicode
>>is script-specific.
> Yes, but be careful of misinterpreting combining classes as saying
> anything about what is or isn't a valid sequence -- they say
> absolutely nothing in that regard.

I see. I misunderstood that.

> It didn't really accomplish anything to have all the different fixed
> position classes, though. If anything, it created some complications,
> which I won't elaborate on.

Your answer leads me to the version 2.0.14 of UnicodeData, quoted.

UnicodeData-2.0.14.txt
: 0E31;THAI CHARACTER MAI HAN-AKAT;Mn;98
: 0E34;THAI CHARACTER SARA I;Mn;99
: 0E35;THAI CHARACTER SARA II;Mn;100
: 0E36;THAI CHARACTER SARA UE;Mn;101
: 0E37;THAI CHARACTER SARA UEE;Mn;102
: 0E38;THAI CHARACTER SARA U;Mn;103
: 0E39;THAI CHARACTER SARA UU;Mn;104
: 0E3A;THAI CHARACTER PHINTHU;Mn;105
: 0E47;THAI CHARACTER MAITAIKHU;Mn;106
: 0E48;THAI CHARACTER MAI EK;Mn;107
: 0E49;THAI CHARACTER MAI THO;Mn;108
: 0E4A;THAI CHARACTER MAI TRI;Mn;109
: 0E4B;THAI CHARACTER MAI CHATTAWA;Mn;110
: 0E4C;THAI CHARACTER THANTHAKHAT;Mn;111
: 0E4D;THAI CHARACTER NIKHAHIT;Mn;112
: 0E4E;THAI CHARACTER YAMAKKAN;Mn;128

I agree that they should be simplified. All of the Mn are simply
assigned distinct increasing values (note that none is 0).

> At any rate, between 2.0 and 3.0, a lot of fixed-position
> classes, both for Thai and for other scripts, were simplified. In so
> doing, many were set to 0.

http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html#Modification
History
: Unicode 2.1.8
: Changes to combining class values. Most Indic fixed position class
: non-spacing marks were changed to combining class 0. This fixes some
: inconsistencies in how canonical reordering would apply to Indic
: scripts, including Tibetan. Indic interacting top/bottom fixed position
: classes were merged into single (non-zero) classes as part of this
: change. Tibetan subjoined consonants are changed from combining class 6
: to combining class 0. Thai pinthu (U+0E3A) moved to combining class 9.
: Moved two Devanagari stress marks into generic above and below combining
: classes (U+0951, U+0952).

Let's talk about the idea behind combining classes. From "The Unicode
Standard 3.0" and information from you, it's my impression that :
(1) The reason for having combining classes came from the different ways
possible to encode the same character. The same character must always
compare eqaul no matter how it is encoded, using precomposed characters
or through composition.
(2) The criteria for assigning combining classes is that the string
before and after normalization must be rendered the same. The text that
look the same must always compare equal, regardless of the order of
(non-interacting) marks in the memory representation. For example,
BASE + ABOVE_MARK + BELOW_MARK = BASE + BELOW_MARK + ABOVE_MARK

At least for Indic (which includes Thai), the criteria before 2.1,
seemed to ensure just (1), discarded entirely typographically
interatacting marks. This could be accomplished w/o combining class at
all, simply sort the marks using their code point will do.
To ensure (2), interacting marks must be assigned the same (non-zero)
combining class as said in the modification history (requoted).

   Note: Unlike other classses, the relation of different classes
        in fixed position classes is not clear. All I know it that
        class 10..199 are called fixed position classes.
        I can't find any detail on that. Do you have any?

: Indic interacting top/bottom fixed position classes were merged into
: single (*non_zero*) classes as part of this change.

They really said that but this is not what actually happended.
For Thai, all above vowels and above marks except tone marks are
assigned class 0?

UnicodeData-3.0.2.txt
: 0E34;THAI CHARACTER SARA I;Mn;0
: 0E35;THAI CHARACTER SARA II;Mn;0
: 0E36;THAI CHARACTER SARA UE;Mn;0
: 0E37;THAI CHARACTER SARA UEE;Mn;0
: 0E38;THAI CHARACTER SARA U;Mn;103
: 0E39;THAI CHARACTER SARA UU;Mn;103
: 0E3A;THAI CHARACTER PHINTHU;Mn;9
: 0E47;THAI CHARACTER MAITAIKHU;Mn;0
: 0E48;THAI CHARACTER MAI EK;Mn;107
: 0E49;THAI CHARACTER MAI THO;Mn;107
: 0E4A;THAI CHARACTER MAI TRI;Mn;107
: 0E4B;THAI CHARACTER MAI CHATTAWA;Mn;107
: 0E4C;THAI CHARACTER THANTHAKHAT;Mn;0
: 0E4D;THAI CHARACTER NIKHAHIT;Mn;0
: 0E4E;THAI CHARACTER YAMAKKAN;Mn;0

I can't find any reason for that? Above marks should all be in the
fixed position classes (10..199). This leads me to the impression that
they didn't finish the job yet?

> My opinion is that they should have been simplified, but that setting the
> bulk of them to 0 was a mistake and creates some significant problems
> (which go a step beyond the questions you raise here).

Can you elaborate on this?

> I think they should
> have been simplified in line with the final suggestion you make: have
> those that interact typographicallay have the same class. (I'd say the
> same of many other combining marks in a number of other scripts.)

I see that this is not just the problem for Thai.

> I personally think that mai eek and sara ii should have the *same*
> combining class. But that's immaterial at this point since the fact
> is that they do not, nor is UTC willing to change them so that they
> have the same combining class.
> ... at this point, the stability requirements prohibit the
> combining class of sara ii from being changed at all...
> due to a commitment not to alter normalised representations from version
> 3.0. We are stuck with the vowels that position above having combining
> classes of 0, for better or worse.

IMO, it'll be the best if we could change that. But apart from that, it
still be useful to note what is right or wrong than not to say about it.
After all, this happends to other (Indic) scripts too, right?

-- 
Samphan Raruenrom
Information Research and Development Division,
National Electronics and Computer Technology Center, Thailand.
http://www.nectec.or.th/home/index.html



This archive was generated by hypermail 2.1.2 : Sun Jun 02 2002 - 05:04:17 EDT