Re: Fixed position combining classes (Was: Combining class for Thai characters)

From: Peter_Constable@sil.org
Date: Mon Jun 03 2002 - 14:25:40 EDT


On 06/02/2002 05:40:05 AM Samphan Raruenrom wrote:

>> My opinion is that they should have been simplified, but that setting
the
>> bulk of them to 0 was a mistake and creates some significant problems
>> (which go a step beyond the questions you raise here).
>
>Can you elaborate on this?

Given the characters

: 0E35;THAI CHARACTER SARA II;Mn;0
: 0E39;THAI CHARACTER SARA UU;Mn;103

consider the sequences

< 0e35, 0e39 > vs. < 0e39, 0e35 >

I'm guessing your first reaction will be to say that these cannot co-occur.
That is true for the Thai language, but may not be true for other languages
written with Thai script.

Now, the problem with the sequences above is that they are visually
indistinct, meaning that they could not possibly be used by users for a
semantically-relevant distinction. From the user's perspective, they are
identical. Moreover, it would fit a user's expectations to have string
comparisons to equate them (e.g. a search for < 0e35, 0e39 > should find a
match if the data contains < 0e39, 0e35 >). They are both
canonically-ordered sequences, however, since U+0E35 has a combining class
of 0. The result is that string comparisons that rely on normalisation into
any one of the existing Unicode normalisation forms (NFD, NFC, NFKD, NFKC)
will fail to consider these as equal.

>IMO, it'll be the best if we could change that. But apart from that, it
>still be useful to note what is right or wrong than not to say about it.
>After all, this happends to other (Indic) scripts too, right?

There are some similar problems in at least Lao, Khmer and Myanmar. I don't
recall for certain, but there may also be similar problems in Hebrew. In
some cases, the problem is one of having canonical ordering that
distinguishes pairs that are visually non-distinct, while in other cases
it's the opposite: canonical ordering fails to distinguish different
sequences where more than one ordering -- corresponding to a visual
distinction -- is needed. My recollection with regard to Hebrew is that
this occurs (there being some cases in which Biblical Hebrew texts have
more than one accent on a consonant).

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Mon Jun 03 2002 - 12:48:30 EDT