L2/11-064

To: UTC

From: Mark Davis

Date: 2011-01-08

Re: Case-Insensitive Regex

 

There has been a request for clarification of how to do case-insensitive matching in regex, according to the Unicode guidelines (L2/11-030). At a high level, I think the request for clarification is a reasonable one, but it is unclear as yet exactly what we should recommend in our guidelines.

 

Case-insensitive matching is one example of matching that is preserved under transformation; canonically-equivalent matching is another. There are some fundamental issues that we need to investigate that are common to both.

 

Whatever we do, I think we need to have some fairly extensive public review of the issue, with a document that outlines all of the alternative approaches, with known plus’s and minus’s. We may need to take into account the ‘degree of difficulty’ involved in the implementation of these, and may have a basic recommendation for Level 1, and a more sophisticated recommendation for Level 2.

 

Here are some comments on the proposal. There are two main questions:

 

1. Should the transform be "black-box" or "white-box"?

 

With a black-box model, the following principle is true:

 

if R matches S, then R’ matches T(S),
where R’ is the transform-equivalent version of R for T.

 

For example:

  1. if R matches “Fred”, then the case-insensitive version of R matches “fred”.
  2. if R matches “O¨BB”, then the canonically-equivalance version of R matches “ÖBB”.

 

With a white-box model, some facets of the internal structure are equivalent, and others are not. One would look at each construct and decide on a case-by-case basis that one construct should match insensitively, and another construct should not.

 

Personally, I don’t find this a compelling model: someone might argue that /\p{Lu}/ should not match lowercase letters, since it is explicitly uppercase, but one could just as well argue that /[A-Z]/ should not, since it is explicitly also uppercase, or that /FRED/ should not, for the same reason.

 

1. How to handle Non-1:1 transforms.

 

When the transform doesn't map 1:1 between codepoints, but can also map 2:1, 1:2 or other combinations, the matching algorithms get trickier. For example, should /[aA][:Mn:]/ match (under canonical equivalence) the single character “Ä”?

 

For this, we can also divide into ‘black box’ and ‘white box’ approaches.