Subject: CaseFolding

From: Mark Davis [mark@macchiato.com]
Sent: Friday, November 03, 2000 11:44 AM


A. There is a proposed new version of CaseFolding on the alpha 3.1 site. The distinguishing features are:


1. You can now choose to use the file to case fold with either simple mappings (string lengths don't change) or full mappings (string lengths may change).


2. You can choose whether to have Turkish be not-fully case insensitive, or have it distinguish dotted i from undotted i.


See http://www.unicode.org/Public/3.1-Update/CaseFolding-3.d2.alpha.txt



B. If this is adopted, then some changes will need to be added to TR 21 (http://www.unicode.org/unicode/reports/tr21/) to explain the new file. I suggest these be remanded to the editorial committee. In addition,


a. the following typo has been called to my attention: "the first conditions for upper and uniqueUpper are reversed"


b. we can make the generation of representative characters from the equivalence classes unique by adding the following.


"The representative character from the equivalence class is chosen such that UCD_lower(UCD_upper(x)) == x. If there is no character meeting this condition, then the character with the lowest code point is chosen. Otherwise, if there is more than one such character meeting this condition, then of the characters meeting this condition, the one with the lowest code point is chosen."


With the current data file, the second two sentences are unnecessary, but this allows for future anomalies.


C. The addition in (b) would cause us to change the representative character for {sigma, SIGMA, final_sigma} in the data file to be sigma instead of final_sigma. Note: a slightly more sophisticated algorithm could use the FINAL condition to do a case folding that used sigma if the following character is an L*, and final_sigma otherwise. Like Turkish, we could make this an option in the file.


First comments:


Michael Kaplan:

I would also suggest an additional folding be considered for the TR.... folding the characters in Extension B that are literally considered "mistakes". Having them able to fold into the characters that they should be is an interesting and potentially very useful operation to consider, especially since there really is no other sensible collation that makes sense for Extension B.


Walt Daniels:

I see creeping featurism as a serious disease. Adding options to handling things the way that is best for some particular group is usually a disservice to all other groups dealing with the extra complexity. Admittedly a balancing act but one that should weight low complexity heavily.