RE: UCD 3.1, Final Beta - Case folding

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Tue Mar 06 2001 - 11:27:20 EST

Next message: Marco Cimarosti: "RE: Romanche dash"
Previous message: Antoine Leca: "Re: UCD 3.1, Final Beta - Case folding"
Maybe in reply to: Carl W. Brown: "UCD 3.1, Final Beta - Case folding"
Next in thread: Carl W. Brown: "RE: UCD 3.1, Final Beta - Case folding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Antone;

Case folding is very useful for Turkish. For example "Istanbul" is spelled
with an uppercase I DOT ABOVE in Turkish. By case folding but versions are
converted to "istanbul" for matching purposes.

Case folding also converts Greek beta symbol to a small letter beta.

In essence case folding is the equivalent of shift to upper followed by a
shift to lower.

The I shifts are

To upper:

0049 -> 0049
0069 -> 0049
0130 -> 0130
0131 -> 0049

To lower:

0049 -> 0069
0130 -> 0069

The only real difference is that all sigmas are the non-final sigma. There
is no need for the sigma adjustment since the text is for comparison purpose
only.

What I am suggesting is that removing the COMBINING DOT ABOVE after any i
will produce a better matching string. I can find no instance where
dropping it will case false matches. Not dropping it will produce false
mismatches.

Carl

-----Original Message-----
From: Carl W. Brown [mailto:cbrown@xnetinc.com]
Sent: Monday, March 05, 2001 11:19 AM
To: Unicode List
Subject: RE: UCD 3.1, Final Beta - Case folding

-----Original Message-----
From: Antoine Leca [mailto:Antoine.Leca@renault.fr]
Sent: Monday, March 05, 2001 9:57 AM
To: Unicode List
Cc: Unicode List
Subject: Re: UCD 3.1, Final Beta - Case folding

>Carl W. Brown wrote:
>>
>> I noticed that there is no mention of the casing special case:
>>
>> # Lithuanian
>>
>> 0307; 0307; ; ; lt AFTER_i; # Remove DOT ABOVE after "i" with upper or
>> titlecase
>>
>> The case folding is locale-less so it seems to me the it is probably
better
>> to remove the COMBINING DOT ABOVE after all 'i' / 'I' regardless of
locale
>> to make it work for Lithuanian. I doubt that this will case serious
>> problems with caseless compares for other locales.

>I think the 'I' above is a typo, isn't it? You meant 'j', don't you?

I do mean 'i' not 'j'.

>If not, please consider a Turkish text, fully decomposed: there, a
dot_above
>U+0307 following an uppercase I U+0049 should certainly *not* be dropped.

This works for Turkish as well. Case folding folds dotted and dotless i
into 'i'.

0049; C; 0069; # LATIN CAPITAL LETTER I
0130; I; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0131; I; 0069; # LATIN SMALL LETTER DOTLESS I

By removing the COMBINING DOT ABOVE, the fully decomposed text will match
the composed text and therefore be a better representation of case folding.

>Antoine

Next message: Marco Cimarosti: "RE: Romanche dash"
Previous message: Antoine Leca: "Re: UCD 3.1, Final Beta - Case folding"
Maybe in reply to: Carl W. Brown: "UCD 3.1, Final Beta - Case folding"
Next in thread: Carl W. Brown: "RE: UCD 3.1, Final Beta - Case folding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT