Re: Fixed position combining classes

From: Samphan Raruenrom (samphan@thai.com)
Date: Thu Jun 06 2002 - 10:53:35 EDT


Peter_Constable@sil.org wrote:
> On 06/02/2002 05:40:05 AM Samphan Raruenrom wrote:
>>>My opinion is that they should have been simplified, but that setting the
>>>bulk of them to 0 was a mistake and creates some significant problems
>>>(which go a step beyond the questions you raise here).
>>Can you elaborate on this?
> Given the characters
> : 0E35;THAI CHARACTER SARA II;Mn;0
> : 0E39;THAI CHARACTER SARA UU;Mn;103
> consider the sequences
> < 0e35, 0e39 > vs. < 0e39, 0e35 >
> I'm guessing your first reaction will be to say that these cannot co-occur.

No, not at all :) I already learn from you to be more open-minded to
this Unicode kind of things.

> That is true for the Thai language, but may not be true for other languages
> written with Thai script.

I've read a book on the history of Thai characters and found that many
vowels change position through history. So this issue is more
understandable to me now.

> Now, the problem with the sequences above is that they are visually
> indistinct, meaning that they could not possibly be used by users for a
> semantically-relevant distinction. From the user's perspective, they are
> identical. Moreover, it would fit a user's expectations to have string
> comparisons to equate them (e.g. a search for < 0e35, 0e39 > should find a
> match if the data contains < 0e39, 0e35 >). They are both
> canonically-ordered sequences, however, since U+0E35 has a combining class
> of 0. The result is that string comparisons that rely on normalisation into
> any one of the existing Unicode normalisation forms (NFD, NFC, NFKD, NFKC)
> will fail to consider these as equal.

Let's talk about somethings that really happend in Thai.

1)

0E01;THAI CHARACTER KO KAI;Lo;0
0E38;THAI CHARACTER SARA U;Mn;103
0E4D;THAI CHARACTER NIKHAHIT;Mn;0

The sequences (which happend in Pali transcription)

(a) KO KAI + SARA U + NIKHAHIT
(b) KO KAI + NIKHAHIT + SARA U

They're look the same but not equal because combining class
of NIKHAHIT happend to be 0 so both are normalized.

2)

0E32;THAI CHARACTER SARA AA;Lo;0
0E48;THAI CHARACTER MAI EK;Mn;107
0E33;THAI CHARACTER SARA AM;Lo;0;L;<compat> "NIKHAHIT" "SARA AA"

There're two ways to represent the word KO KAI + MAI EK + SARA AM

(a) KO KAI + MAI EK + SARA AM
(b) KO KAI + NIKHAHIT + MAI EK + SARA AA

(b) must be in this sequence to get the intended look for
the word (not that this is the valid sequence for Thai/WTT).
That is the mai-ek is on top of the nikhahit.

The problem is with the NFKD/NFKC of (a), which is

(c) KO KAI + MAI EK + NIKHAIT + SARA AA

Which will be rendered with nikhahit on top of mai-ek.
Which is not the same as (a), and is not the intened look.
So this means that the string change its shape after
normalization. Is this a violation of any principle?

The problem comes also from the fact that combining class of
NIKHAHIT is 0 and that make reording of (c) impossible.

-- 
Samphan Raruenrom
Information Research and Development Division,
National Electronics and Computer Technology Center, Thailand.
http://www.nectec.or.th/home/index.html



This archive was generated by hypermail 2.1.2 : Thu Jun 06 2002 - 09:07:11 EDT