L2/02-046

To: UTC
Re: CaseFolding and SpecialCasing Tweaks
From: Mark Davis
Date: 2001-01-28

The UTC directed that changes be made in SpecialCasing and CaseFolding to preserve canonical equivalence. This is the principle that:

Given a function f mapping from strings to strings,
if a string X is canonically equivalent to a string Y, then f(X) is canonically equivalent to f(Y).

This applies to the case operations: toUppercase, toLowercase, toTitlecase, toCasefold. In looking at the changes, at first it appeared that this would require substantial changes, so the editorial committee held off from making the changes. However, on retooling the validity program for a finer-grained test, it became clear that only a small change was required, and it only involves the beloved Turkish I.

Note: The goal is only to preserve canonical equivalence for the full case operations, not to the simple ones. The latter can never preserve canonical equivalence, as discussed in TR #21.

Background. We currently have:

http://www.unicode.org/Public/BETA/Unicode3.2/CaseFolding-5d2.txt

# A. To do a simple case folding, use the mappings with status C + S + I.
# B. To do a full case folding, use the mappings with status C + F + I.
# The mappings with status I can be omitted depending on the desired case-folding
# behavior. (The default option is to retain them.)
...
0049; C; 0069; # LATIN CAPITAL LETTER I
0130; I; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0131; I; 0069; # LATIN SMALL LETTER DOTLESS I

http://www.unicode.org/Public/BETA/Unicode3.2/SpecialCasing-3.2.0d5.txt

0307; ; 0307; 0307; After_Soft_Dotted; # COMBINING DOT ABOVE

http://www.unicode.org/Public/BETA/Unicode3.2/UnicodeData-3.2.0d8.txt

0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;...;;;0069;
0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049


Changes. To make CaseFolding obey canonical equivalence, we must have 0130 and 0049 0307 fold to the same values. Since CaseFolding is not context sensitive, this has to be 0069 0407. Thus we need to change the line 0130 above to:

0130; F; 0069 0407; # LATIN CAPITAL LETTER I WITH DOT ABOVE

This change requires a similar change in SpecialCasing, by

1) Changing the default mapping to preserve canonical equivalence

0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE

2) Then adding the previous mapping from UnicodeData, but conditional on Turkic, so it still works in that environment.

0130; 0069; 0130; 0130; TR # LATIN CAPITAL LETTER I WITH DOT ABOVE
0130; 0069; 0130; 0130; AZ # LATIN CAPITAL LETTER I WITH DOT ABOVE

3) And finally, changing the context-dependent rule dot_above rule for canonical equivalence:

0307; ; 0307; 0307; After_Soft_Dotted; # COMBINING DOT ABOVE

to only apply to Turkic:

0307; ; 0307; 0307; After_Soft_Dotted TR; # COMBINING DOT ABOVE
0307; ; 0307; 0307; After_Soft_Dotted AZ; # COMBINING DOT ABOVE

Since the dotted uppercase I is only used in Turkic, this is a neutral change overall.