L2/11-163 Date: Sat Apr 30 13:28:54 CDT 2011 Name: Tom Christiansen (tchrist@perl.com) Subject: PRI179 feedback regarding case-insensitivity I would like to voice my dissent over the proposal to withdraw the recommendation for doing full case-insensitive matching in RL2.4 Default Loose Matches. I believe that this recommendation first became official in tr18-6 released on 2002-04-21. Because of that recommendation, the 5.8.0 release of Perl on 2002-06-01 implemented full case folding for case-insensitive matches. In the nearly nine years since then, users have therefore come to expect this behavior, and it would be a severe hardship to withdraw it from then now. If the Unicode Standard makes no provision for permitting the recommended behavior that it almost nine years old now, you will put us in a hard place. It is true that we have had bugs in the handling of full case mappings, but we have worked hard to eliminate those. One particular place that these bugs had been a problem for us was in square-bracketed character classes, although these are now fixed. I will discuss those momentarily, but first I wish to draw attention to the places where it is very important that full case folding on case-insensitive matches be supported. I believe that your example using "ß" U+00DF is insufficient to motivate people to implement full case folding. This is both because the Latin script has comparatively few code points with a full 1:Many case mapping, and also because most of those are there so that round-trip conversion to and from legacy repertoires will preserve ligatures like FF and FFI. There are only 16 code points in the Latin script with 1:Many case mappings, and there are 6 such code points in the Armenian script. In contrast, the Greek script has 81 such code points, and it is therefore in Greek that the issue arises most frequently. Please consider these strings: lowercase: "ᾲ στο διάολο" titlecase: "Ὰͅ Στο Διάολο" uppercase: "ᾺΙ ΣΤΟ ΔΙΆΟΛΟ" lowercase: "\x{1FB2} \x{3C3}\x{3C4}\x{3BF} \x{3B4}\x{3B9}\x{3AC} \x{3BF}\x{3BB}\x{3BF}" titlecase: "\x{1FBA}\x{345} \x{3A3}\x{3C4}\x{3BF} \x{394}\x{3B9} \x{3AC}\x{3BF}\x{3BB}\x{3BF}" uppercase: "\x{1FBA}\x{399} \x{3A3}\x{3A4}\x{39F} \x{394}\x{399} \x{386}\x{39F}\x{39B}\x{39F}" lowercase: "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI} \N{GREEK SMALL LETTER SIGMA} \N{GREEK SMALL LETTER TAU} \N{GREEK SMALL LETTER OMICRON} \N{GREEK SMALL LETTER DELTA} \N{GREEK SMALL LETTER IOTA} \N{GREEK SMALL LETTER ALPHA WITH TONOS} \N{GREEK SMALL LETTER OMICRON} \N{GREEK SMALL LETTER LAMDA} \N{GREEK SMALL LETTER OMICRON}" titlecase: "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA} \N{COMBINING GREEK YPOGEGRAMMENI} \N{GREEK CAPITAL LETTER SIGMA} \N{GREEK SMALL LETTER TAU} \N{GREEK SMALL LETTER OMICRON} \N{GREEK CAPITAL LETTER DELTA} \N{GREEK SMALL LETTER IOTA} \N{GREEK SMALL LETTER ALPHA WITH TONOS} \N{GREEK SMALL LETTER OMICRON} \N{GREEK SMALL LETTER LAMDA} \N{GREEK SMALL LETTER OMICRON}" uppercase: "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA} \N{GREEK CAPITAL LETTER IOTA} \N{GREEK CAPITAL LETTER SIGMA} \N{GREEK CAPITAL LETTER TAU} \N{GREEK CAPITAL LETTER OMICRON} \N{GREEK CAPITAL LETTER DELTA} \N{GREEK CAPITAL LETTER IOTA} \N{GREEK CAPITAL LETTER ALPHA WITH TONOS} \N{GREEK CAPITAL LETTER OMICRON} \N{GREEK CAPITAL LETTER LAMDA} \N{GREEK CAPITAL LETTER OMICRON}" A user making a case-insensitive match of /^ᾲ/i, which I will here indicate with a trailing /i to mean an embedded (?i), will certainly expect all three of those versions to be matched. This remains true no matter what case the string or the pattern. lowercase w/lowercase: "ᾲ στο διάολο" =~ /^ᾲ/i lowercase w/titlecase: "ᾲ στο διάολο" =~ /^Ὰͅ/i lowercase w/uppercase: "ᾲ στο διάολο" =~ /^ᾺΙ/i titlecase w/lowercase: "Ὰͅ Στο Διάολο" =~ /^ᾲ/i titlecase w/titlecase: "Ὰͅ Στο Διάολο" =~ /^Ὰͅ/i titlecase w/uppercase: "Ὰͅ Στο Διάολο" =~ /^ᾺΙ/i uppercase w/lowercase: "ᾺΙ ΣΤΟ ΔΙΆΟΛΟ" =~ /^ᾲ/i uppercase w/titlecase: "ᾺΙ ΣΤΟ ΔΙΆΟΛΟ" =~ /^Ὰͅ/i uppercase w/uppercase: "ᾺΙ ΣΤΟ ΔΙΆΟΛΟ" =~ /^ᾺΙ/i And indeed, in Perl all 9 of those match. Furthermore, the charclass negations also all correctly fail to match (where !~ is the negation of =~): lowercase: "ᾲ στο διάολο" !~ /^[^ᾲ]/i titlecase: "Ὰͅ Στο Διάολο" !~ /^[^ᾲ]/i uppercase: "ᾺΙ ΣΤΟ ΔΙΆΟΛΟ" !~ /^[^ᾲ]/i Those are all true because those strings indeed all begin with a case-mapping of "ᾲ", so to say it doesn't start with something that is not that code point is a true assertion. Based upon conversations I have had with people who actually handle Greek text, I believe that this functionality is just as important to them as to us is matching both "Apple" and "apple" with /^a/i. It seems culturally unfair to deny users of the Greek script the same convenience in matching that users of the Latin enjoy. That said, I would like to draw your attention to two different problems that arise due to full case mappings, or multichar folds as they are sometimes called. Both relate to user expectations that a square-bracketed character class specifying single code points will always *match* a single code point. Under full case mapping, it may not. And this can cause problems. The first problem is that unlike lookaheads, lookbehinds in regexes are often implemented such that a fixed-size string is specified. Therefore, while (?<=[abc]) is permitted, (?<=[abc]+) is not. When matching case insensitively, character classes that seem to specify only single code points *can* become variable in length, and thus forbidden from lookbehinds under most albeit not all implementations. This is not limited to bracketed character classes, although is usually where it shows up, because a user has unknowingly included something in the charclass that has a multichar fold. To demo without charclasses, this: % perl -cwe '/(?<=\xDF)/' compiles fine, but making it case-insensitive causes a compilation error: % perl -cwe '/(?<=\xDF)/i' Variable length lookbehind not implemented in regex m/(?<=\xDF)/ at -e line 1. The second problem with multichar folds in case-insensitive matches is that many patterns which were written for with an 8-bit mind set get transferred to Unicode unmodified. So a pattern like [^\x80-\xFF], which was equivalent to [^\x00-\x7F], is now equivalent to [^\x-\x7F\x{100}-\x{10FFFF}]. In the patterns below, the first is performed case sensitively and the second case insensitively. I believe that users will be confused by the results of the last two out of three case-insensitive matches. No: "dress" =~ /[^\x00-\x7F]/ No: "dress" =~ /[^\x00-\x7F]/i No: "dress" =~ /[\x80-\xFF]/ !! Yes: "dress" =~ /[\x80-\xFF]/i Yes: "dress" =~ /^[^\x80-\xFF]+$/ !! No: "dress" =~ /^[^\x80-\xFF]+$/i The reason that is happening is because of this: No: "dress" =~ /\xDF/ Yes: "dress" =~ /\xDF/i Although it might be argued that one should not allow multichar folds from happening in character classes, my earlier Greek example shows that they *must*. I do not believe you can appease both groups at once. It is possible that a regex flag regarding simple-vs-full case mapping might help, but then you have to decide on reasonable defaults. Perhaps a reasonable default is simple only, so that users needing multichar folds can specify that. But they may not know to do so. It is a difficult task to educate users about the pitfalls of ASCII-minded patterns applied to Unicode. They do not understand the two "!!" matches given above. Even when they are brought to understand these, they invariably consider them "wrong". That's because they are thinking in terms of sets and set-complements when they see bracketed character classes. To them, since there is clearly no \xDF in "dress", it is unreasonable that something that says has no \xDF as /^[^\x80-\xFF]+$/i does should turn around and claim that it found something that isn't there. The question also arises about what you do with a back references. Can /(ᾲ)/i match two (or more) code points for group 1, even though only one was specified? Apparently it must. Finally, I would like to suggest that case-insensitive matching as it is currently defined is considerably less useful in practice than it should be. The purpose for case insensitive matching is to allow a shorthand form to spare the user from having to enumerate all possible variations of the same letter. However, apart from a *VERY* few rules such as for ANGSTROM SIGN, MICRO SIGN, and KELVIN SIGN, and the familiar but painful LATIN SMALL LETTER SHARP S, you really cannot do that. For example, even though these are all considered the same letter at the primary collation strength, they do not match one another case insensitively: d U+0064 LATIN SMALL LETTER D ð U+00F0 LATIN SMALL LETTER ETH U+A77A LATIN SMALL LETTER INSULAR D d U+FF44 FULLWIDTH LATIN SMALL LETTER D Given that *they are all the same letter according to the UCA*, I submit that they *should* (be able to) match each other, case-insensitively. Furthermore, while the last of those has a K decomposition to the first of them, the middle two do not. The same thing happens with an "s": s U+0073 LATIN SMALL LETTER S ſ U+017F LATIN SMALL LETTER LONG S U+A785 LATIN SMALL LETTER INSULAR S s U+FF53 FULLWIDTH LATIN SMALL LETTER S The third of those does not count as an "s" matched case sensitively, but the second does. Again, there is no decomposition that will get you to something that tests as "s". This is a problem with many letters. Imagine that you want to match LATIN SMALL LETTER O no matter what sort of combining marks follow it. Old code in ISO 8858-1 may have used [óòôöõø] for that, but with all the precomposed characters like ō and not to mention the possibility of arbitrary combining marks, that won't work. So you would think that it would be enough to write NFD($string) =~ /(?=o)\pM*/ but it is not. That's because not code points that are considered the same letter as "o" according to the UCA have any available decomposition that actually starts with "o"! You have the same problem with many other letters. I can easily produce a comprehensive list of these. I am aware of RL3.4 Tailored Loose Matches To meet this requirement, an implementation shall provide for loose matches based on a locale's collation order, with at least 3 levels. and that would *appear* to resolve the issue. However, I don't believe it does. First, one should not have to go to the highest possible level of Unicode regex support merely to achieve this basic functionality that is in practice so very needed by so many (and it is!). Furthermore, RL3.4 mentions only locales. One should be able to apply the default UCA without dragging messy locales into it. I therefore propose that UCA matching without locales be made a Level 2 requirement, and that Level-3 be reserved for locales, since it necessarily requires tailoring support and plain UCA support should not require such. Note that UCA matching, at least at the primary strength, solves your vexing problem of canonical equivalence. This is another reason that UCA primary strength comparison should be moved into Level 2. What users of case-insensitively truly want is to be able to compare whether things are the same letter IN THE UCA SENSE, without respect to casing or accent marks. They should be able to get at that easily, and under the current requirements, they cannot do so. Tom Christiansen tchrist@perl.com -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --