L2/18-267R

Title: Proposed Fix for Character Name Restrictions on Use of Hyphen-Minus

Source: Ken Whistler

Date: August 23, 2018

Status: For consideration by UTC


Background

The UTC received feedback from David Corbett indicating that the current
character name syntax restrictions on the use of U+002D HYPHEN-MINUS
do not completely match the claims made about character name
uniqueness for identifiers. (See L2/18-231, Sat Jun 23 11:18:05.)

In particular, if a hypothetical character name "BLAH- -X" and another
hypothetical character name "BLAH - X" were transformed by substituting
an underscore for each hyphen-minus *and* for each space, then both
hypothetical character names would fold to identical strings "BLAH___X",
and hence would no longer be distinguished as identifiers.

The problem arises from the fact that the character name syntax specified
in Section 4.8 of the Unicode Standard allows for an initial hyphen-minus
for subparts of character names, as it must to cover a few existing character
names such as:

U+0F60 TIBETAN LETTER -A

It also allows for a trailing hyphen-minus for subparts of characters
names, as it must to cover a few existing character names such as:

U+0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN

However, those rules do not explicitly disallow 
a hyphen-minus with a SPACE on both sides.

The UCD does not in fact contain any character with a hyphen-minus
preceded and followed by a SPACE, nor is it likely that the UTC would ever decide
to standardize a character with such a name. But formally, the rules for
character name restrictions currently do not disallow doing so.

That in turn means there is a logical hole in the following paragraph
about Names as Identifiers (p. 180, TUS 11.0):

"... a common strategy is to replace any hyphen-minus or space in a
character name by a single "_" when constructing a formal identifier
from a character name. This strategy automatically results in a
syntactically correct identifier in most formal languages. Furthermore,
such identifiers are guaranteed to be unique, because of the special
rules for character name matching."

In fact such identifiers *are* unique, given the current set of
character names, but they are not *guaranteed* to be unique by
the syntactic constraints on character names per se. The guarantee is
simply a matter of committees not standardizing particular character
names (such as a "BLAH- -X" versus "BLAH - X" pair) which would fall
together if just substituting underscores for all hyphen-minus
and space characters. Note that character *name* uniqueness would
still prevent the committees from standardizing such a name pair,
but that restriction is not the same as a syntactic guarantee for
the identifier substitution rule.

Proposal

The fix is pretty simple. Rule R3 should be extended to disallow
isolated medial hyphen-minus characters in Unicode character names.
The text currently reads:

R3 U+002D HYPHEN-MINUS does not occur as the first or last character
of a character name, nor immediately preceding or following another
hyphen-minus character. (In other words, multiple occurrences
of U+002D in sequence are not allowed.)

The proposed revision would read:

R3 U+002D HYPHEN-MINUS does not occur as the first or last character
of a character name, >>nor immediately between two spaces,<<
nor immediately preceding or following another
hyphen-minus character. (In other words, multiple occurrences
of U+002D in sequence are not allowed.)

That revision would automatically rule out names of the "BLAH - X"
variety, and with the other restrictions already 
specified, would result in the underscore substitution strategy having 
a guarantee for identifier uniqueness.

If the UTC agrees to this change, it should also request a
ballot comment on the CD for the 6th Edition of 10646, to make
sure that the corresponding character name restrictions specified
in 10646 also disallow isolated medial hyphen-minus characters in 
character names.

I would also recommend adding a short note to the UAX #44 section
5.9.2 Matching Character Names to clarify the relationship between
the formal name matching rule (UAX44-LM2), the restrictions on
the use of spaces and hyphen-minuses in character names, and
the use of the underscore-replacement strategy for creating
syntactically valid (and unique) formal identifiers.