Re: Regex Clarifications

From: Mark Davis

Date: 2011-02-10


Here are some requested clarifications, on the basis of discussions on the i18n-dev email list.


1. RL1.1        Hex Notation

To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode code point (from U+0000 to U+10FFFF).


It should be made clear that the syntax must allow use of the hex notation for the Unicode code point rather than the corresponding code units for UTF-16 or UTF-8, so syntax like \x{D800}\x{DC00} or \x{F0}\x{90}\x{80}\x{80} do not meet this requirement.



2. RL1.7       Supplementary Code Points


      To meet this requirement, an implementation shall handle the full

      range of Unicode code points, including values from U+FFFF to

      U+10FFFF. In particular, where UTF-16 is used, a sequence

      consisting of a leading surrogate followed by a trailing surrogate

      shall be handled as a single code point in matching.


Add a note that it is permissible but not required to match an isolated surrogate code point (such as \x{D800}, in text that supports it (Unicode 16-bit strings and Unicode 32-bit characters).



3. Conformance clause 0


C0. An implementation claiming conformance to this

    specification at any Level shall identify the version of

    this specification and the version of the Unicode Standard.


It is unclear that we want to require the specific version of Unicode.



4. RL1.2       Properties


  To meet this requirement, an implementation shall provide at

  least a minimal list of properties, consisting of the following:


Make it even clearer that in order to meet this requirement, the implementation has to satisfy the Unicode definition of these, not others. However, the names used for the properties might need to be different for compatibility. For example, if a regex engine already has “Alphabetic”, for compatibility it may need a different name, such as “Unicode_Alphabetic”



5. In addition, I think we should add a new Level 2 condition.


Add this to the proposed update UTS#18:


RL2.7       Full Property Support


To meet this requirement, an implementation shall provide all Unicode properties listed below.


This list will be populated by including the properties in Table 7. Property Index by Scope of Use (http://www.unicode.org/reports/tr44/#Property_Index), with the following exceptions:

  1. properties that are neither informative nor normative, such as contributory properties or provisional properties
  2. properties that are either obsolete or deprecated
  3. Unicode_1_Name
  4. Unicode_Radical_Stroke


Ed Note: Feedback is requested on whether exceptions should be added.