Subject: CaseFolding
From: Mark Davis
[mark@macchiato.com]
Sent: Friday, November 03, 2000 11:44 AM
A. There is a proposed new version of CaseFolding on the
alpha 3.1 site. The distinguishing features are:
1. You can now choose to use the file to case fold with either simple mappings (string lengths don't change) or full mappings (string lengths may change).
2. You can choose whether to have Turkish be not-fully case
insensitive, or have it distinguish dotted i from undotted i.
See http://www.unicode.org/Public/3.1-Update/CaseFolding-3.d2.alpha.txt
B. If this is adopted, then some changes will need to be
added to TR 21 (http://www.unicode.org/unicode/reports/tr21/)
to explain the new file. I suggest these be remanded to the editorial
committee. In addition,
a. the following typo has been called to my attention:
"the first conditions for upper and uniqueUpper are reversed"
b. we can make the generation of representative characters
from the equivalence classes unique by adding the following.
"The representative character from the equivalence
class is chosen such that UCD_lower(UCD_upper(x)) == x. If there is no
character meeting this condition, then the character with the lowest code point
is chosen. Otherwise, if there is more than one such character meeting this
condition, then of the characters meeting this condition, the one with the
lowest code point is chosen."
With the current data file, the second two sentences are
unnecessary, but this allows for future anomalies.
C. The addition in (b) would cause us to change the
representative character for {sigma, SIGMA, final_sigma} in the data file to be
sigma instead of final_sigma. Note: a slightly more sophisticated algorithm
could use the FINAL condition to do a case folding that used sigma if the
following character is an L*, and final_sigma otherwise. Like Turkish, we could
make this an option in the file.
First comments:
Michael Kaplan:
I would also suggest an additional folding be considered for the TR.... folding the characters in Extension B that are literally considered "mistakes". Having them able to fold into the characters that they should be is an interesting and potentially very useful operation to consider, especially since there really is no other sensible collation that makes sense for Extension B.
Walt
Daniels:
I see creeping featurism as a serious disease.
Adding options to handling things the way that is best for some particular
group is usually a disservice to all other groups dealing with the extra
complexity. Admittedly a balancing act but one that should weight low
complexity heavily.