L2/10-091

Title:  Textual Fixes Needed for UTS #18, Regex

Author: Ken Whistler

Date:   March 18, 2010

Action: For consideration by the UTC


Discussion of the details of loose matching rules recently
on the unicode list turned up some infelicities in the
wording in UTS #18 that may be leading to confusion about
implementation of Unicode Regex. 

I suggest that the UTC review the issues related to loose
matching in the document and decide whether to issue a
proposed update of UTS #18 with textual fixes for them.
There is also a separate issue related to references for
casing.

1. In Section 1.2 Properties, 3rd paragraph, the current UTS #18
has the following text:

  "There are both abbreviated names and longer, more descriptive
  names. It is strongly recommended that both names be
  recognized, and that loose matching of property names be
  used, whereby the case distinctions, whitespace, hyphens,
  and underbar are ignored."
  
This text is not wrong, but is a bit antiquated. The suggestion
is that it should be updated to more precisely refer to
property aliases and property value aliases, rather than
just "names", and should make reference to UAX44-LM3 for
loose matching, rather than doing a pocket definition of
loose matching here. In general this text should be aligned
more closely with current wording in UAX #44.


2. In Section 2.5 Name Properties, subsection "Individually
Named Characters", 3rd and 4th paragraphs, there is a
similar pocket definition of loose matching for character
names, which if anything is a little more ambiguous and
problematical.

   "As with other property values, names should use a loose
   match, disregarding case, spaces and hyphen (the underbar
   character "_" cannot occur in Unicode character names).
   An implementaiton may also choose to allow namespaces,
   where some prefix like "LATIN LETTER" is set globally
   and used if there is no match otherwise.
   
   "There are, however, three instances that require
   special-casing with loose matching, where an extra
   test shall be made for the presence or absence of
   a hyphen."
   
The introductory phrase, "As with other property values,"
is misleading, because the loose matching for character
names is not identical to the loose matching for
symbolic property aliases and property value aliases.

The suggestion is that this text should be updated to spell
out loose matching for character names by reference to
UAX44-LM2. When doing so, the 3 exceptions will then be
part of the rule, and rather than being stated as separate
normative requirements for Regex. This will handle more
gracefully the possibility of any future addition of
characters involving a contrast based on presence of a
hyphen. The 3 exceptions can then be listed here
informatively, rather than as the target of a "shall"
requirement.


3. In general, in Section 2.5 and throughout the document,
it would be advisable to make a pass to eliminate requirements
that are currently phrased in terms of "should", to be
replaced by phrases using "shall", where the clear intent
is to impose a normative requirement.


4. In Section 2.4 Default Loose Matches there is an
anomalous reference to the superseded UAX #21, Case Mappings.
This is erroneously pointing people to a very outdated
(and superseded) document, and is unfortunately propagating
those references into secondary material about Unicode Regex.
This reference should be updated to current section
references in the latest version of the Unicode Standard.
This has been partially fixed in the references section
of UTS #18, but needs to be corrected in Section 2.4 as well.