L2/12-162

Date/Time: Tue May 1 19:06:08 CDT 2012
Name: Karl Williamson
Subject: PRI 182 Feedback

1) I support the withdrawal of the recommendation for doing full case-insensitive matching in RL2.4 Default Loose Matches. Perl 5, which I contribute to, partially implements this currently, but we were forced to remove some support because of counter-intuitive results that broke real-world applications. An example is the negation of a character class that contains the LATIN SMALL LETTER SHARP S, which folds to the two character sequence 'ss'. It is easy to naively create regular expressions which, very non-obviously, match strings which contain s, but not ones that contain 'ss'. (See https://rt.perl.org/rt3/Public/Bug/Display.html?id=89750). I also believe there are intractable problems with back references with multi-character folds.

2) I support the changes to Canonical equivalent matching. Perl has never complied with the current recommendation, because I couldn't figure out a way to even begin to do it; and I presume my predecessors couldn't either.

3) I am very strongly opposed to having most properties match differing sets of code points under case sensitive versus case-insensitive matching. First, there are only a few properties and property-values currently that aren't already closed under case-insensitive matching. That means that mostly the case-insensitiveness doesn't matter. Here is the list for 6.0 where it does (unless there are bugs in the code I used to generate this):

\p{Age=3.2}
\p{Age=4.0}
\p{Age=4.1}
\p{Age=5.0}
\p{Age=5.1}
\p{Age=6.0}
\p{Bc=NSM}
\p{Blk=Basic_Latin}
\p{Blk=Combining_Diacritical_Marks}
\p{Blk=Georgian}
\p{Blk=Georgian_Supplement}
\p{Blk=Greek_And_Coptic}
\p{Blk=Greek_Extended}
\p{Blk=IPA_Extensions}
\p{Blk=Latin_1_Supplement}
\p{Blk=Latin_Extended_A}
\p{Blk=Latin_Extended_B}
\p{Blk=Latin_Extended_C}
\p{Blk=Latin_Extended_D}
\p{Blk=Letterlike_Symbols}
\p{Blk=Phonetic_Extensions}
\p{Ccc=240}
\p{CI=Y}
\p{Comp_Ex=Y}
\p{CWCF=Y}
\p{CWL=Y}
\p{CWT=Y}
\p{CWU=Y}
\p{Dia=Y}
\p{DI=Y}
\p{Dt=Com}
\p{Ea=Na}
\p{GCB=EX}
\p{Gc=Ll}
\p{Gc=Lt}
\p{Gc=Lu}
\p{Gc=M}
\p{Gc=Mn}
\p{Gr_Ext=Y}
\p{Jt=T}
\p{Lb=AI}
\p{Lb=CM}
\p{Lower=Y}
\p{NFC_QC=M}
\p{NFC_QC=N}
\p{NFKC_QC=M}
\p{SB=EX}
\p{SB=LO}
\p{SB=UP}
\p{Sc=Grek}
\p{Sc=Zinh}
\p{SD=Y}
\p{Upper=Y}
\p{WB=Extend}

(The above list omits the obvious complements, and was generated using full case folding rules.)

For many of these, it really makes no sense to include things outside the real property. I can't imagine someone wanting a case-insensitive closure of any Age property-value, for example, nor WB=Extend. Nor for that matter, any Block. What is the use case for the example in the draft proposal, \p{Block=Phonetic_Extensions} matching things outside the original block? Why would anyone want to do that?

Most of the others match just a few more code points caselessly. For only a very few of these properties is the case-insensitiveness more than incidental.

Thus, most likely, programmers will construct regular expressions that match case-insensitively based on considerations other than the properties it contains, and will be surprised when these expressions matched more than they thought they would. Having the match do this violates the Principle of Least Astonishment. (see, e.g. http://en.wikipedia.org/wiki/Principle_of_least_astonishment) I think most people would be surprised to find that /(?i)\p{ASCII}/ matches 130 rather than 128 characters (under simple matching).

I believe that adopting the the current proposal will lead to more bugs and time wasted when regular expressions don't behave as their writers expect, and potentially security holes. We in Perl 5 went through the same effort in trying to figure out what was going wrong with the application I mentioned in my comment on 1) above. To be more succinct, I believe that the adoption of this, as worded, will lead to counter-intuitive results which will be harmful, and may force this to be withdrawn in the future, just as full matching is now being withdrawn. The only legitimate reason to do so that I can think of is for mathematical consistency. I do believe that such consistency is an argument for doing something, but there are times when other considerations trump it.

There are, however, properties where the programmer is expecting caseless matching to do something differently. People expect /(?i)\p{Uppercase=Y}/ to match lowercase and titlecase letters as well. When Perl 5 didn't do that, we got bug reports from the field about it. The solution we came to (including posting on the Unicode forum and getting Asmus' feedback) is to change the case-insensitive matching for just those properties that are all about case; these are the ones that programmers expect to match differently under (?i), and are the ones in the list above where there is a significant difference between cased and caseless matching. \p{Lowercase=Y}, \p{Uppercase=Y}, and \p{Cased=Y} all match the exact same set of code points case-insensitively (also for the =N sets); likewise \p{Gc=Lu}, \p{Gc=Ll}, \p{Gc=Lt}, and \p{Gc=LC} all match the exact same set of code points case-insensitively (And the same for their Posix subsets [:upper:], [:lower:], and [:alpha:]) No other properties change behavior when matching caselessly. This has been in the field for about a year now with zero complaints.

My guess is that there are two main reasons for the text in the current draft proposal:

1) perceived implementation simplicity
2) perceived cognitive simplicity: fewer rules to know

All are valid reasons, but not when the result is harmful, which is what I've asserted above. Perl's implementation is not very much more complicated than just accumulating all code points and then applying case closure. We keep two lists, one of explicitly mentioned code points (including in ranges), and one of code points from properties. As we parse a property name under caseless matching, if it is one of those few that are different, we just substitute the closure equivalent of it. For example, we change Gc=Ll into Gc=LC. The other list of explicitly mentioned code points does have case closure applied to it at the end, and then the union is taken with the other list.

I also claim that any apparent congnitive simplicity gain is only illusory. People's cognitive map, I believe, doesn't consider that most properties might match differently caselessly.

Programmers also expect that ranges under (?i) will match case insensitively, so for example /(?i)[A-Z]/ is the same as /[A-Za-z]/. When there were bugs in Perl 5 where this didn't work properly, we heard about it. For example, there was a bug report filed when we didn't foldcase the modern Cyrillic alphabet. Programmers expect that writing a range is the logical range. They don't expect, I suspect, that the range consisting of the modern lowercase Greek letters would caselessly match the unassigned U+03A2 code point (unassigned because there are two lowercase sigmas, and only one uppercase). (We don't specially deal (yet) with that one, but we do correctly handle the similar situation on EBCDIC machines where [A-Z] if done just by code point order would include more than 26 characters.) We do document that one has to be careful with ranges.

I'm not sure what you are asking for about matching the dot caselessly. In Perl 5, it matches any single code point, regardless of case-insensitivity. This seems to me to be the only sensible approach, even if full case foldcase matching is done.

Review Notes:

1) See above

2) I do not have the expertise to comment on the DUCET proposal.

3) I lean very slightly to having the @ notation in a separate section.