Re: Case blind comparison

From: Gary Roberts (gar@sparc.SanDiegoCA.NCR.COM)
Date: Wed Jul 30 1997 - 19:23:02 EDT


> From: Mark Davis <mark_davis@taligent.com>
>
> If you really only have the routines toUpper() and toLower(), and you
> are trying to do a caseless comparison, then you have to use:
>
> normalForm = toUpper(toLower(source)); // or toLower(toUpper(source));
>
> This takes account of all of the characters that have many-to-one case
> mappings, whether they be uppercase or lowercase (it is not just limited
> to es-zed).
>
> You can also use the information in the Unicode character database to
> generate a more efficient version of the above method.
>
> Mark

Actually, I am using unidata2.txt to construct caseless comparison. From
my examination of these tables, the only situations where

toUpper(source) does not equal toUpper(toLower(source))

is for the Turkish capital letter I dot, and for Georgian. The proposal
I suggested (mapping Georgian to lower case, and `I dot' to `I'), is
exactly equivalent to toUpper(toLower(source)), although I did not
think of it in that fashion, so, it turns out that we agreed, although
this was not obvious at first.

Ken and Kent bring up certain canonical equivalences, which the
technique I proposed will not handle.

Ken further mentions:

> I think you need to distinguish between:
>
> 1. exact binary matching ( a-acute != a + combining acute )
> 2. exact match on canonical equivalence ( a-acute = a + combining acute)
> 3. case sensitive match on equivalence classes for particular collation
> (where, for example, s = long s = modifer letter small s)

Again, I think I agree, and that these are desireable (particularly 2
and 3). I also have a need for case blind versions of these three. In
those terms, I was trying to define a case blind version of 1. I can
see there is some ambiguity in this task.

I am now tempted to include mapping

U+0340 -> U+0300
U+0341 -> U+0301
U+0343 -> U+0313
U+0374 -> U+02B9
U+037E -> U+003B
U+0387 -> U+00B7
U+04D4 -> U+00C6
U+04D5 -> U+00C6
U+04D8 -> U+018F
U+04D9 -> U+018F
U+04E0 -> U+01B7
U+04E1 -> U+01B7
U+1FEF -> U+0060
U+1FFD -> U+00B4
U+2000 -> U+2002
U+2001 -> U+2003
U+2126 -> U+03A9
U+212A -> U+004B
U+212B -> U+00C5
U+2329 -> U+3008
U+232A -> U+3009

but I am clearly on a slippery slope.

Just for starters, I came across the following:

U+04E8 = U+019F (where = means is canonically equal to)
U+04E9 = U+0275

U+04E9 uppercases to U+04E8,
U+0275 does not uppercase to U+019F

Is this a bug? I suspect not, but if it is not a bug, I am at a loss as
to what to do, because it implies:

U+019F == U+04E8 == U+04E9 (where == means compares equal to)
U+0275 == U+049E
but U+019F != U+0275 (where != means compares not equal to)
I don't have an easy way to accomplish this, nor am I convinced it
is desireable.

I also notice

U+1FBE = U+0399

This looks a bit odd to me, and I would appreciate if someone knowledgible
on this matter could confirm this equivalence.

The temptation of adding any of the above mappings is starting to leave me.

                                *



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT