L2/11-070R3

 

Re: Draft PRI Text on Unicode Regular Expression Guidelines
Source: Mark Davis
Date: 2011/02/10

 

The Unicode consortium is considering changes to the Unicode Regular Expression Guidelines (http://unicode.org/reports/tr18/) that have arisen in connection with case-insensitive and canonical-equivalent matching, and is soliciting feedback on these changes.

 

Background

 

Part of the issue is defining more precisely the connection between matching of regular expressions and equivalence relations among strings. More formally:

 

Matching under Equivalence Relations. A regular expression R matches according to an equivalence relation E whenever for all strings S and T, if S is equivalent to T under E, then R matches S if and only if R matches T.

 

For case-insensitivity, the equivalence relation is established in Unicode according to whether two strings case-fold to the same value. The case folding can either be simple (a 1:1 mapping of code points) or full (which has 1:N mappings). For canonical equivalence, the equivalence relation is established by whether two strings have the same NFD format, which has N:M mappings.

 

Examples:

 

  1. “ABC” and “Abc” are equivalent under both full and simple case folding.
  2. “office” (with the “ff” ligature) and “OFFICE” are equivalent under full case folding, but not under simple case folding.
  3. <o-horn, dotbelow> and <o-dotbelow, horn> are canonically equivalent, since they both have the same NFD form: <o, dotbelow, horn>

 

1. Full vs Simple case-insensitive matching

 

The consortium is planning to withdraw the recommendation for doing full case-insensitive matching in RL2.4 Default Loose Matches. The text would be modified to focus this section on conversion.

 

The text currently reads:

 

RL2.4        Default Loose Matches

To meet this requirement:

  1. if an implementation provides for case-insensitive matching, then it shall provide at least the full, default Unicode case-insensitive matching.
  2. if an implementation provides for case conversions, then it shall provide at least the full, default Unicode case conversion.

 

The new text would read:

 

RL2.4        Default Case Conversion

To meet this requirement, if an implementation provides for case conversions, then it shall provide at least the full, default Unicode case conversion.

 

There are two reasons for removing full case-insensitive matching:

 

  1. It is unclear how full case-insensitive matching can be effectively implemented in regular expressions, especially with back references.
  2. There are a number of examples where the results would be counter-intuitive for typical users of regular expressions.

 

What is feasible is to describe how to transform text into the fully-case-folded form, and construct regular expressions targeted at such text. So the text of UTS#10 focus on such guidelines and not requirements.

 

Note: the text “To correctly implement a caseless match and case conversions, see UAX #21: Case Mappings [Case].” would also be updated to become a current reference.

 

2. Canonical-equivalent matching

 

The consortium is planning to withdraw the recommendation for doing full canonical-equivalence matching in RL2.1 Canonical Equivalents. The current text has:

 

RL2.1 Canonical Equivalents

To meet this requirement, an implementation shall provide a mechanism for ensuring that all canonically equivalent literal characters match.

 

However, the way most regular expression engines work, this requirement cannot be satisfied. The problem is that canonical equivalence may involve reordering of characters. For example, each of the following are equivalent:

 

  1. o + horn + dotbelow
  1. U+006F ( o ) LATIN SMALL LETTER O +
  2. U+031B ( ̛ ) COMBINING HORN +
  3. U+0323 ( ̣ ) COMBINING DOT BELOW
  1. o + dotbelow + horn
  1. U+006F ( o ) LATIN SMALL LETTER O +
  2. U+0323 ( ̣ ) COMBINING DOT BELOW +
  3. U+031B ( ̛ ) COMBINING HORN
  1. o-horn + dotbelow
  1. U+01A1 ( ơ ) LATIN SMALL LETTER O WITH HORN
  2. U+0323 ( ̣ ) COMBINING DOT BELOW
  1. o-dotbelow + horn
  1. U+1ECD ( ọ ) LATIN SMALL LETTER O WITH DOT BELOW +
  2. U+031B ( ̛ ) COMBINING HORN
  1. o-horn-dotbelow
  1. U+1EE3 ( ợ ) LATIN SMALL LETTER O WITH HORN AND DOT BELOW

 

The regular expression pattern /o\x{31B}/ matches the first two characters of #1, the first and third characters of #2, the first character of #3, part of the first character together with the third character of #4, and part of the character of #5. Some of these issues are brought out in the text of UTS#18, but implementing a strict reading of RL2.1 is infeasible, because in practice regex APIs are not set up to match parts of characters or handle discontiguous selections.

 

There are many other edge cases: A combining mark may come from some part of the pattern far removed from where the base character was, or may not explicitly be in the pattern at all. It is unclear what a dot should match, or how back references should work.

 

What is feasible is to construct patterns that will match against NFD (or NFKD) text, and the text of UTS#18 should reflect that. That is, it should describe a process whereby

 

  1. The text being matched is put into into a defined normalization form (NFD or NFKD).
  2. The pattern is not modified in any way from what the user provides.
  3. Matching proceeds on a code-point by code-point basis, as usual.

 

Note that the author of the pattern must know the normalization form of the text, and write the pattern accordingly.

 

3. Case-insensitive matching with properties

 

The consortium is planning to describe more precisely how to match case-insensitively.

 

In particular, a regular expression pattern P can be made to match insensitively by making the following changes in the interpretation of P:

 

  1. Each string is matched case-insensitively. That is, it is logically expanded into a sequence of OR expressions, where each OR expression lists all of the characters that have a simple case-folding to the same value.
  1. For example, /Dåb/ matches as if it were expanded into /(?:d|D)(?:å|Å|Å)(?:b|B)/
  2. This includes back references, such as /(?i)(a.c)\1/, where \1 matches what is in the first grouping.
  1. Each character class is closed under case. That is, it is logically expanded into a set of code points, and then closed by adding all simple case equivalents of each of those code points.
  1. For example, [\p{Block=Phonetic_Extensions} [A-E]] is a character class that matches 133 code points (under Unicode 6.0). Its case-closure adds 7 more code points: a-e, Ᵽ, and Ᵹ, for a total of 140 code points.

 

For both property character classes and explicit character classes, closing under simple case-insensitivity means including characters not in the set.

 

  1. The case-closure of \p{Block=Phonetic_Extensions} includes two characters not in that set, namely Ᵽ and Ᵹ.
  2. The case-closure of [A-E] includes five characters not in that set, namely [a-e].

 

There have been suggestions to restrict case insensitive regex matching so that it would not apply to some or all property-based character classes. One suggestion, for example, is to close all of the POSIX properties under case, but not others. That would require some narrower notion of matching under an equivalence than that presented in Matching under Equivalence Relations above in the Background section. For example, under Matching under Equivalence Relations, the following is true:

 

/(?i)[[\x{80}-\x{FF}]-[:Block=Latin_1_Supplement:]]/  = /[]/

 

(note that Latin_1_Supplement block is identical to U+0080..U+00FF)

 

Under a system whereby the block property was not case folded, the following would be true:

 

/(?i)[[\x{80}-\x{FF}]-[:Block=Latin_1_Supplement:]]/ = /[Å Ÿ]/

 

If the block property were not case folded, it would also mean that an implementation cannot fully resolve a character class containing properties, and then apply case-closure; instead, it must apply case-closure selectively as the character class is interpreted.