L2/10-091 Title: Textual Fixes Needed for UTS #18, Regex Author: Ken Whistler Date: March 18, 2010 Action: For consideration by the UTC Discussion of the details of loose matching rules recently on the unicode list turned up some infelicities in the wording in UTS #18 that may be leading to confusion about implementation of Unicode Regex. I suggest that the UTC review the issues related to loose matching in the document and decide whether to issue a proposed update of UTS #18 with textual fixes for them. There is also a separate issue related to references for casing. 1. In Section 1.2 Properties, 3rd paragraph, the current UTS #18 has the following text: "There are both abbreviated names and longer, more descriptive names. It is strongly recommended that both names be recognized, and that loose matching of property names be used, whereby the case distinctions, whitespace, hyphens, and underbar are ignored." This text is not wrong, but is a bit antiquated. The suggestion is that it should be updated to more precisely refer to property aliases and property value aliases, rather than just "names", and should make reference to UAX44-LM3 for loose matching, rather than doing a pocket definition of loose matching here. In general this text should be aligned more closely with current wording in UAX #44. 2. In Section 2.5 Name Properties, subsection "Individually Named Characters", 3rd and 4th paragraphs, there is a similar pocket definition of loose matching for character names, which if anything is a little more ambiguous and problematical. "As with other property values, names should use a loose match, disregarding case, spaces and hyphen (the underbar character "_" cannot occur in Unicode character names). An implementaiton may also choose to allow namespaces, where some prefix like "LATIN LETTER" is set globally and used if there is no match otherwise. "There are, however, three instances that require special-casing with loose matching, where an extra test shall be made for the presence or absence of a hyphen." The introductory phrase, "As with other property values," is misleading, because the loose matching for character names is not identical to the loose matching for symbolic property aliases and property value aliases. The suggestion is that this text should be updated to spell out loose matching for character names by reference to UAX44-LM2. When doing so, the 3 exceptions will then be part of the rule, and rather than being stated as separate normative requirements for Regex. This will handle more gracefully the possibility of any future addition of characters involving a contrast based on presence of a hyphen. The 3 exceptions can then be listed here informatively, rather than as the target of a "shall" requirement. 3. In general, in Section 2.5 and throughout the document, it would be advisable to make a pass to eliminate requirements that are currently phrased in terms of "should", to be replaced by phrases using "shall", where the clear intent is to impose a normative requirement. 4. In Section 2.4 Default Loose Matches there is an anomalous reference to the superseded UAX #21, Case Mappings. This is erroneously pointing people to a very outdated (and superseded) document, and is unfortunately propagating those references into secondary material about Unicode Regex. This reference should be updated to current section references in the latest version of the Unicode Standard. This has been partially fixed in the references section of UTS #18, but needs to be corrected in Section 2.4 as well.