Re: Turkic casefolding rules

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sun, 13 May 2012 22:51:08 +0200

Casefilding is just like lowercasing, except that it won't convert
uppercase letters to lowercase that would other wise remain distinct
if they were left in uppercase.

Casefolding just makes a difference from lowercasing only on letters
(or digits/symbols) that are not exactly bicameral (like A/a) or
exactly monocameral (like Arabic letters and currency symbols),
because casefolding may drop the distinctions on letters that are only
different in the same case but depending on their contextual forms
(e.g. the Greek sigma, the traditional German Ess-tsett.

The Turkic letters in question here are the I and J, which are
distinctuished by their *hard* absence or presence of the dot above,
which is no longer a soft dot. But they still maintain pairs of
letters (with or without the dot) that keep their distinction even
when converted both to lowercase, or converted to uppercase.

What this means is that the Turkic rules requires to handle the ASCII
lowercase "i" or "j" as if they were hard-dotted, i.e. as if they were
canonically equivalent to the undotted letter with a combining dot
above encoded after it.

This is the kind of processing that you need to do prior to applying
the casefolding : you have to decompose the ASCII lowercase letters
"i" and "j" even though they are normally not decomposable and only
canonically equivalent to themselves exactly and nothing else (when
not using the Turkic rules).

2012/5/13 Karl Williamson <public_at_khwilliamson.com>:
> In CaseFolding.txt, it says the following:
>
> "Note that the Turkic mappings do not maintain canonical equivalence without
> additional processing.  See the discussions of case mapping in the Unicode
> Standard for more information."
>
> I couldn't find any more detail about these in the 6.1 Unicode standard.
>  There is more discussion of upper- and lowercasing, but no detail on
> casefolding.
>
> It is my sense that the "additional processing" that is mentioned would be
> the same as that for lowercasing that is specified in SpecialCasing.txt.
>  But since I'm rather an ignoramus on these matters, I'm not sure, and would
> like some guidance.
>
>
>
>
Received on Sun May 13 2012 - 15:59:18 CDT

This archive was generated by hypermail 2.2.0 : Sun May 13 2012 - 15:59:26 CDT