Re: Is UniCode's Thai character representation is acceptable by TISI or not?

From: Samphan Raruenrom (samphan@thai.com)
Date: Tue Jul 16 2002 - 23:12:31 EDT


Dear Mark,

Thanks for informative reply. :)

Mark Davis wrote:
> Some comments below.
> ----- Original Message -----
> From: "Samphan Raruenrom" <samphan@thai.com>
> To: "Asmus Freytag" <asmusf@ix.netcom.com>
> Cc: "Sreedhar M" <sreedhar@cmcltd.com>; <unicode@unicode.org>; "Rick McGowan" <rick@unicode.org>
> Sent: Tuesday, July 16, 2002 07:22
> Subject: Re: Is UniCode's Thai character representation is acceptable by TISI or not?
>>Asmus Freytag wrote:
>>>At 12:06 PM 7/16/02 +0700, Samphan Raruenrom wrote:
>>Problems from Unicode properties
>>- error in combining class of vowel signs make normalization worthless
>> in some cases. This is important if you want to compare strings.
> Meaning: the normalized forms of two strings are not equal in cases
> where Thais would consider them equal, right?

Definitely.

>>- decomposition of SARA AM add more problem to normalization
> I don't recall seeing that note; I'll look forward to your report.

Please see my discussion with khun Peter Constable quoted below.

>>- some properties make grapheme cluster for Thai
>> imcompatible with the way Thai expect, e.g PINTHU as
>> virama, SARA AM not a combining character
> In the last UTC, action was taken that is not yet in the draft TR on
> boundaries. In particular, this affects Thai.

Glad to hear that :)

>>Inaccuracy in the Unicode book
>>- backspace 'always' use the same (grapheme cluster) character boundary
>> as Del and left/right arrow. Actually Thai use backspace to delete single
>> character not the whole cluster. So character boundary for backspace should
>> be locale specific.
> This text will be overriden by the TR.

Great!

>>- in Thai, zero width space is said to be able to expand in full-justified
>> paragraph. Actually it is always zero width.
> There may be some misunderstanding here. What is meant is: if you had
> the sequence ABCD, and between the B and the C was a zero-width space,
> AND you were inter-character spacing for justification, you would not
> expect to see:
> A BC D
> Instead, you would expect to see
> A B C D
> That is, the zero-width space does not prevent the characters from
> using inter-character spacing.

Sorry for misunderstanding that. A short explanation/example like this in
the book (chapter 9), will help a lot.

>>These are things you have to khow after learning the Unicode standard
>>if you plan to work with Thai language, to 'code around' the problem
>>to make it acceptable for Thai people.
>>I plan to write a formal report on the issue, not to change the standard,
>>but to note what is wrong and what have to be code around. So people
>>who like to work with Thai language (like you) will know the right thing
>>to do and not repeat the same mistake as in some softwares.

-------- Original Message --------
Subject: Re: Fixed position combining classes
Date: Thu, 06 Jun 2002 21:53:35 +0700
From: Samphan Raruenrom <samphan@thai.com>
Organization: NECTEC
To: Peter_Constable@sil.org
CC: Arthit Suriyawongkul <Arthit.Suriyawongkul@sun.com>, Suwit Srivilairith <suwits@th.ibm.com>, Thai IT Standards Newsgroup <th.pubnet.it-stds@thaigate.r.nii.ac.jp>, Trin Tansetthi <trin@mozart.inet.co.th>, Unicode Public List <unicode@unicode.org>, Virach Sornlertlamvanich <virach@nectec.or.th>
References: <OF96DCF778.1396EDF8-ON86256BCD.0063F8F3@sil.org>

Peter_Constable@sil.org wrote:
> Now, the problem with the sequences above is that they are visually
> indistinct, meaning that they could not possibly be used by users for a
> semantically-relevant distinction. From the user's perspective, they are
> identical. Moreover, it would fit a user's expectations to have string
> comparisons to equate them (e.g. a search for < 0e35, 0e39 > should find a
> match if the data contains < 0e39, 0e35 >). They are both
> canonically-ordered sequences, however, since U+0E35 has a combining class
> of 0. The result is that string comparisons that rely on normalisation into
> any one of the existing Unicode normalisation forms (NFD, NFC, NFKD, NFKC)
> will fail to consider these as equal.

Let's talk about somethings that really happend in Thai.

1)

0E01;THAI CHARACTER KO KAI;Lo;0
0E38;THAI CHARACTER SARA U;Mn;103
0E4D;THAI CHARACTER NIKHAHIT;Mn;0

The sequences (which happend in Pali transcription)

(a) KO KAI + SARA U + NIKHAHIT
(b) KO KAI + NIKHAHIT + SARA U

They're look the same but not equal because combining class
of NIKHAHIT happend to be 0 so both are normalized.

2)

0E32;THAI CHARACTER SARA AA;Lo;0
0E48;THAI CHARACTER MAI EK;Mn;107
0E33;THAI CHARACTER SARA AM;Lo;0;L;<compat> "NIKHAHIT" "SARA AA"

There're two ways to represent the word KO KAI + MAI EK + SARA AM

(a) KO KAI + MAI EK + SARA AM
(b) KO KAI + NIKHAHIT + MAI EK + SARA AA

(b) must be in this sequence to get the intended look for
the word (not that this is the valid sequence for Thai/WTT).
That is the mai-ek is on top of the nikhahit.

The problem is with the NFKD/NFKC of (a), which is

(c) KO KAI + MAI EK + NIKHAIT + SARA AA

Which will be rendered with nikhahit on top of mai-ek.
Which is not the same as (a), and is not the intened look.
So this means that the string change its shape after
normalization. Is this a violation of any principle?

The problem comes also from the fact that combining class of
NIKHAHIT is 0 and that make reording of (c) impossible.

-------- Original Message --------
Subject: Re: Fixed position combining classes
Date: Fri, 7 Jun 2002 11:44:42 +0700
From: Martin_Hosken@sil.org
To: Peter_Constable@sil.org, samphan@thai.com
CC: Arthit Suriyawongkul <Arthit.Suriyawongkul@sun.com>, Suwit Srivilairith <suwits@th.ibm.com>, Thai IT Standards Newsgroup <th.pubnet.it-stds@thaigate.r.nii.ac.jp>, Trin Tansetthi <trin@mozart.inet.co.th>, Unicode Public List <unicode@unicode.org>, Virach Sornlertlamvanich <virach@nectec.or.th>

Dear Peter,

Thanks for forwarding this. As you know, I don't monitor the Unicode list
otherwise I would probably never get *any* work done :(

Please include the sender's address in all replies (if any replies just go
to the lists, I won't see them :( )

>(a) KO KAI + SARA U + NIKHAHIT
>(b) KO KAI + NIKHAHIT + SARA U
>They're look the same but not equal because combining class
>of NIKHAHIT happend to be 0 so both are normalized.

Yes. This is part of the combining class chaos. I don't even know if the
solution lies in the current model. My idea of what is needed is that we
want to end up with a Thai sequence of Base + Ldias* + UDias* and within
the LDias* and UDias*, order is significant such that mai ek + sara ii is
different from sara ii + mai ek. The reason for going with this approach is
that then rendering can make misspellings obvious.

This can be achieved by making all lower diacritics of class, say, 103 and
upper diacritics of, say, 107.

<samphan>
(a) KO KAI + MAI EK + SARA AM
(b) KO KAI + NIKHAHIT + MAI EK + SARA AA

The problem is with the NFKD/NFKC of (a), which is

(c) KO KAI + MAI EK + NIKHAIT + SARA AA
</samphan>

This is also a good point. The decomposition does not take into account the
necessary re-ordering that needs to happen to insert the mai ek between the
nikhahit and sara aa. Personally, I would suggest that the easiest approach
is to not list the compatibility decomposition, thus (a) does not
"dekompose" to (c). This just leaves the issue of (b) which should be
classed as a misspelling because there is no model that will enable (a) and
(b) to compare as equal strings (which is what the overall issue is about).

The problem is how to make it obvious that (b) is a misspelling. Beyond
moving nikhahit slightly to the left on rendering, I don't have an answer
to this one. How would you want such a misspelling to be displayed? Putting
a dotted circle (a common approach) before the sara aa doesn't seem to cut
it. Just sliding the nikhahit left, into its normal nikhahit position (as
opposed to where it goes for sara am) doesn't shout loud enough. Any
thoughts?

One approach may be to look at MS's implementation of Thai in Uniscribe.
There are 6 classes:

A1: U+0E31, U+0E34..U+0E37
A2: U+0E47, U+0E4D
A3: U+0E48..U+0E4B
A4: U+0E4C, U+0E4E
B1: U+0E38, U+0E39
B2: U+0E3A

The shaping rules state that a correct sequence (as opposed to inserting a
dotted circle marking an error) is:

      Base + B1? + B2? + A1? + A2? + A3? + A4?

I.e. if you get things in the wrong order then you have problems. This
raises two issues:

1. Is the model correct? Is it wide enough to support the needs for
minority languages (both present and future. Of course future predictions
are impossible).
2. Can we use this as a basis for arriving at combining classes for Thai.
E.g.

A1 -> 107
A2 -> 108
A3 -> 109
A4 -> 110
B1 -> 103
B2 -> 104

Of course this is all a little academic since the rules of Unicode change
means that, since the combining class is a normative property, we can't
just ask for the combining classes to be changed. But that is part of a
wider issue of Unicode and changing mistakes.

Yours,
Martin

-- 
Samphan Raruenrom
Information Research and Development Division,
National Electronics and Computer Technology Center, Thailand.
http://www.nectec.or.th/home/index.html



This archive was generated by hypermail 2.1.2 : Tue Jul 16 2002 - 21:21:11 EDT