Regex Clarifications

L2/11-072R1

Re: Regex Clarifications

From: Mark Davis

Date: 2011-02-10

Here are some requested clarifications, on the basis of discussions on the i18n-dev email list.

1. RL1.1 Hex Notation

To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode code point (from U+0000 to U+10FFFF).

It should be made clear that the syntax must allow use of the hex notation for the Unicode code point rather than the corresponding code units for UTF-16 or UTF-8, so syntax like \x{D800}\x{DC00} or \x{F0}\x{90}\x{80}\x{80} do not meet this requirement.

2. RL1.7 Supplementary Code Points

To meet this requirement, an implementation shall handle the full

range of Unicode code points, including values from U+FFFF to

U+10FFFF. In particular, where UTF-16 is used, a sequence

consisting of a leading surrogate followed by a trailing surrogate

shall be handled as a single code point in matching.

Add a note that it is permissible but not required to match an isolated surrogate code point (such as \x{D800}, in text that supports it (Unicode 16-bit strings and Unicode 32-bit characters).

3. Conformance clause 0

C0. An implementation claiming conformance to this

specification at any Level shall identify the version of

this specification and the version of the Unicode Standard.

It is unclear that we want to require the specific version of Unicode.

4. RL1.2 Properties

To meet this requirement, an implementation shall provide at

least a minimal list of properties, consisting of the following:

Make it even clearer that in order to meet this requirement, the implementation has to satisfy the Unicode definition of these, not others. However, the names used for the properties might need to be different for compatibility. For example, if a regex engine already has “Alphabetic”, for compatibility it may need a different name, such as “Unicode_Alphabetic”

5. In addition, I think we should add a new Level 2 condition.

Add this to the proposed update UTS#18:

RL2.7 Full Property Support

To meet this requirement, an implementation shall provide all Unicode properties listed below.

This list will be populated by including the properties in Table 7. Property Index by Scope of Use (http://www.unicode.org/reports/tr44/#Property_Index), with the following exceptions:

properties that are neither informative nor normative, such as contributory properties or provisional properties
properties that are either obsolete or deprecated
Unicode_1_Name
Unicode_Radical_Stroke

Ed Note: Feedback is requested on whether exceptions should be added.