L2/18-267R Title: Proposed Fix for Character Name Restrictions on Use of Hyphen-Minus Source: Ken Whistler Date: August 23, 2018 Status: For consideration by UTC Background The UTC received feedback from David Corbett indicating that the current character name syntax restrictions on the use of U+002D HYPHEN-MINUS do not completely match the claims made about character name uniqueness for identifiers. (See L2/18-231, Sat Jun 23 11:18:05.) In particular, if a hypothetical character name "BLAH- -X" and another hypothetical character name "BLAH - X" were transformed by substituting an underscore for each hyphen-minus *and* for each space, then both hypothetical character names would fold to identical strings "BLAH___X", and hence would no longer be distinguished as identifiers. The problem arises from the fact that the character name syntax specified in Section 4.8 of the Unicode Standard allows for an initial hyphen-minus for subparts of character names, as it must to cover a few existing character names such as: U+0F60 TIBETAN LETTER -A It also allows for a trailing hyphen-minus for subparts of characters names, as it must to cover a few existing character names such as: U+0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN However, those rules do not explicitly disallow a hyphen-minus with a SPACE on both sides. The UCD does not in fact contain any character with a hyphen-minus preceded and followed by a SPACE, nor is it likely that the UTC would ever decide to standardize a character with such a name. But formally, the rules for character name restrictions currently do not disallow doing so. That in turn means there is a logical hole in the following paragraph about Names as Identifiers (p. 180, TUS 11.0): "... a common strategy is to replace any hyphen-minus or space in a character name by a single "_" when constructing a formal identifier from a character name. This strategy automatically results in a syntactically correct identifier in most formal languages. Furthermore, such identifiers are guaranteed to be unique, because of the special rules for character name matching." In fact such identifiers *are* unique, given the current set of character names, but they are not *guaranteed* to be unique by the syntactic constraints on character names per se. The guarantee is simply a matter of committees not standardizing particular character names (such as a "BLAH- -X" versus "BLAH - X" pair) which would fall together if just substituting underscores for all hyphen-minus and space characters. Note that character *name* uniqueness would still prevent the committees from standardizing such a name pair, but that restriction is not the same as a syntactic guarantee for the identifier substitution rule. Proposal The fix is pretty simple. Rule R3 should be extended to disallow isolated medial hyphen-minus characters in Unicode character names. The text currently reads: R3 U+002D HYPHEN-MINUS does not occur as the first or last character of a character name, nor immediately preceding or following another hyphen-minus character. (In other words, multiple occurrences of U+002D in sequence are not allowed.) The proposed revision would read: R3 U+002D HYPHEN-MINUS does not occur as the first or last character of a character name, >>nor immediately between two spaces,<< nor immediately preceding or following another hyphen-minus character. (In other words, multiple occurrences of U+002D in sequence are not allowed.) That revision would automatically rule out names of the "BLAH - X" variety, and with the other restrictions already specified, would result in the underscore substitution strategy having a guarantee for identifier uniqueness. If the UTC agrees to this change, it should also request a ballot comment on the CD for the 6th Edition of 10646, to make sure that the corresponding character name restrictions specified in 10646 also disallow isolated medial hyphen-minus characters in character names. I would also recommend adding a short note to the UAX #44 section 5.9.2 Matching Character Names to clarify the relationship between the formal name matching rule (UAX44-LM2), the restrictions on the use of spaces and hyphen-minuses in character names, and the use of the underscore-replacement strategy for creating syntactically valid (and unique) formal identifiers.