Re: Fixed position combining classes (Was: Combining class for Thai characters)

From: Peter_Constable@sil.org
Date: Tue Jun 04 2002 - 06:13:09 EDT


On 06/03/2002 05:56:38 PM Kenneth Whistler wrote:

>Peter,

>The problem, of course, is that not all eventualities could be
>foreseen at the time the decisions had to be made -- when normalization
>and Unicode 3.0 were looming...

>So hindsight is 20/20. But at the time, the editors and participants
>in the UTC couldn't get experts to pay enough attention to the
>potential implications for Thai and other Southeast Asian scripts,
>so now we are stuck with a few anomalies that people will just have
>to program around, I am afraid.

I understand. I'm not arguing at this point that the combining classes
should be changed (though I would were it a possibility) -- if you look at
my earlier post in this thread, you'll see that I explained to Khun Samphan
that this is not a possibility. At this point, I'm merely explaining *why*
the combining classes as they stand present issues for implementers.

>> The result is that string comparisons that rely on normalisation into
>> any one of the existing Unicode normalisation forms (NFD, NFC, NFKD,
NFKC)
>> will fail to consider these as equal.
>
>I think you are missing a point here. It is true that if you just
>take the two strings, normalize them, and then compare binary, they
>will compare unequal. But for most user's expectations of equivalent
>string comparisons, simply comparing binary for normalized strings
>is insufficient, anyway. There may be embedded (invisible) format
>control characters (ZWJ and its ilk) which should be ignored on
>comparison -- but a simple binary compare won't do that.

True, but I think there's a categorical difference between the need to
remove ZWJ and its ilk and the other kinds of issues you raise on the one
hand, and on the other, the issues I've raise in relation to combining
classes for SE Asian scripts and Hebrew: the former are things that
implementers have been aware of for a while, but the latter is something
they are likely not aware of, and is exactly the kind of thing people would
have expected normalisation to have dealt with and so are not likely to
notice. Implementers need to have the issues pointed out to them, which is
exactly my intent -- for at least one potential implementer -- with the
comments I have made in this thread.

>The ordinary way to deal with this is to enhance the comparisons,
>often in language-specific ways, to match user expectations of what
>should and should not compare equal under various circumstances.

Is that true everywhere? What about systems for file naming, security,
domain naming, etc. for which language-specific processing is rarely if
ever done? Even in word processors, I doubt that language-tailored
collation-based comparisons are used.

But clearly if the combining classes can't be changed, then some or all of
these will have to start dealing at least with the issues that these
combinng class values raise. At least, given all the hoopla in recent
months about spoofing and security, I'd think people with concerns in this
area would want to deal with the issues presented by these combining class
values.

And if my memory is serving me in relation to Hebrew, we're also going to
have to look at that again and figure out a way to encode needed
distinctions that the fixed position classes cause to be neutralised in
normalisation.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Tue Jun 04 2002 - 05:03:38 EDT