Re: Fixed position combining classes (Was: Combining class for Thai characters)

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jun 03 2002 - 18:56:38 EDT


Peter,

> On 06/02/2002 05:40:05 AM Samphan Raruenrom wrote:
>
> >> My opinion is that they should have been simplified, but that setting
> the
> >> bulk of them to 0 was a mistake and creates some significant problems
> >> (which go a step beyond the questions you raise here).
> >
> >Can you elaborate on this?
>
> Given the characters
>
> : 0E35;THAI CHARACTER SARA II;Mn;0
> : 0E39;THAI CHARACTER SARA UU;Mn;103
>
> consider the sequences
>
> < 0e35, 0e39 > vs. < 0e39, 0e35 >
>
> I'm guessing your first reaction will be to say that these cannot co-occur.
> That is true for the Thai language, but may not be true for other languages
> written with Thai script.

The problem, of course, is that not all eventualities could be
foreseen at the time the decisions had to be made -- when normalization
and Unicode 3.0 were looming. It might have been possible to marginally
improve on the assignments that eventually were made -- but both the
original assignment to fixed position classes, and the later simplification
of the fixed position classes, had to be made *prior* to the accumulation
of experience based on normalization being locked down in the standard.

So hindsight is 20/20. But at the time, the editors and participants
in the UTC couldn't get experts to pay enough attention to the
potential implications for Thai and other Southeast Asian scripts,
so now we are stuck with a few anomalies that people will just have
to program around, I am afraid.

>
> Now, the problem with the sequences above is that they are visually
> indistinct, meaning that they could not possibly be used by users for a
> semantically-relevant distinction. From the user's perspective, they are
> identical. Moreover, it would fit a user's expectations to have string
> comparisons to equate them (e.g. a search for < 0e35, 0e39 > should find a
> match if the data contains < 0e39, 0e35 >). They are both
> canonically-ordered sequences, however, since U+0E35 has a combining class
> of 0. The result is that string comparisons that rely on normalisation into
> any one of the existing Unicode normalisation forms (NFD, NFC, NFKD, NFKC)
> will fail to consider these as equal.

I think you are missing a point here. It is true that if you just
take the two strings, normalize them, and then compare binary, they
will compare unequal. But for most user's expectations of equivalent
string comparisons, simply comparing binary for normalized strings
is insufficient, anyway. There may be embedded (invisible) format
control characters (ZWJ and its ilk) which should be ignored on
comparison -- but a simple binary compare won't do that. The presence
of a ZWSP might or might not be considered as indicative of a string
difference by a user, but would definitively cause the strings to compare
unequal without a corresponding visual difference. On the other hand, the
presence of some types of visual punctuation might be considered insignificant
by a user, and to be ignored, even though causing a visual difference.

The ordinary way to deal with this is to enhance the comparisons,
often in language-specific ways, to match user expectations of what
should and should not compare equal under various circumstances. And
a commonly used technology for that is one form or another of collation
tailoring for culturally expecting string comparison. If such technology
is being used to provide better results, there is no particular reason
why the language-specific tailorings for it cannot also take into account
the few anomalous cases resulting from canonical ordering of dependent
vowels in Brahmi-derived scripts in Southeast Asia, so that, under those
circumstances, < 0e35, 0e39 > vs. < 0e39, 0e35 > *would* compare equal.

>
>
> >IMO, it'll be the best if we could change that. But apart from that, it
> >still be useful to note what is right or wrong than not to say about it.
> >After all, this happends to other (Indic) scripts too, right?
>
> There are some similar problems in at least Lao, Khmer and Myanmar. I don't
> recall for certain, but there may also be similar problems in Hebrew.

And each of the cases are fairly limited and amenable to the same
kinds of solutions, script by script, and language by language.

In any case, I think one is going to have to have some rather
specific string comparison extensions to get Khmer and Myanmar
string orderings and matchings to behave appropriately. And people
who need to make those extensions aren't going to be particularly
misled by the few anomalous instances of above or below vowel
signs having zero combining classes, which make it technically
possible to have non-canonically equivalent spellings of visually
similar combinations.

--Ken



This archive was generated by hypermail 2.1.2 : Mon Jun 03 2002 - 17:21:55 EDT