L2/13-142 Title: Inconsistencies in Name Uniqueness Criteria Author: Ken Whistler Date: July 16, 2013 Status: For consideration by the UTC Summary There are several inconsistencies in the statements about the criteria for name uniqueness, both in Unicode 6.2 and in ISO/IEC 10646 3rd Edition. I think it would be best to try to address and resolve these inconsistencies in the context of the preparation of Unicode 6.3, and in ballot comments on the text of the DIS for ISO/IEC 10646 4th Edition, so that the statements and criteria would both be internally consistent in the standards and be synchronized between the two standards. This contribution provides background information, analysis, and a presentation about alternative approaches the UTC might choose to take to the problem. Background The problem turned up during a recent check on name uniqueness, which uncovered the fact that two (formal) character name aliases are not unique under one interpretation of the name uniqueness criteria. The two character name aliases in question are: CANCEL, character name alias for U+0018 CANCEL CHARACTER, character name alias for U+0094 These two aliases were added to NameAliases.txt in Unicode 6.1, so they have been published in the Unicode Standard since January 31, 2012. The problem is that on one interpretation of the name uniqueness criteria, it should not be allowable to have character name aliases which differ only by the presence or absence of the string "CHARACTER" in the names. If that interpretation is correct, then the UTC made a mistake in Unicode 6.1, and some kind of correction is in order, perhaps to grandfather in the mistake in the statement of the rules. If that interpretation is *not* correct, then the relevant text in the standard should probably be adjusted to ensure that it is internally consistent and does not lead people to an incorrect interpretation of the criteria. The issue derives from the fact that character names, (formal) character name aliases, and the names of named Unicode character sequences are all designated as sharing the same namespace. However, there is not, and never has been any formal syntactic definition of that namespace. Instead, we have gotten by with various statements of "loose matching rules" for determining whether two names match (or not), and have assumed that those statements are internally consistent and also result in a well-defined namespace. Unfortunately, however, we in fact have longstanding inconsistent statements about matching rules, and those inconsistent statements have also been inconsistently extended in applicability as the namespace for character names was extended, first to include the names for named Unicode character sequences and then the (formal) character name aliases. Finally, the attempts to keep 10646 in synch with the Unicode Standard in this regard have themselves also been inconsistent. The result is somewhat of a textual and conformance mess, with different experts asserting different claims about what names match and what names do not, as well as a whiff of incipient fingerpointing regarding the responsibility for the "mistake" noted above for CANCEL and CANCEL CHARACTER. Analysis The full textual details of what is currently published in the two standards regarding this question are provided below in the Appendix section. This should make it much easier to compare text regarding the current state of affairs, without people having to fumble around online comparing 4 different source pages for the Unicode Standard (and policies) and the more difficult-to-locate current text of 10646. I also have included a related excerpt from UTS #18, Unicode Regular Expressions, which has a bearing on this issue. [The loose matching rules in UTS #35 are not relevant to the name matching issue discussed here.] Boiling it down, there are essentially two versions of the "loose" matching rules in the specifications: LMR-A: Ignore (i.e., fold away) any casing distinctions, spaces, and any medial hyphen-minus characters in names. Compare the resulting strings. If the folded strings are binary equal, then the names match. There is one grandfathered exception: U+1180 HANGUL JUNGSEONG O-E does not match U+116C HANGUL JUNGSEONG OE. LMR-B: Identical to LMR-A, except that one also ignores (i.e., folds away) any substring "CHARACTER", "LETTER", or "DIGIT". As the text stands currently, we have the following situation. The Unicode Standard, Version 6.2 --Asserts that character names, character name aliases, and the names of Unicode named character sequences share the same namespace. --Asserts that LMR-A applies to character names and character name aliases. --Asserts (inconsistently) that either LMR-A or LMR-B applies to the names of Unicode named character sequences. ISO/IEC 10646 3rd Edition --Asserts that character names, character name aliases, and the names of NUSI share the same namespace. --Asserts that LMR-B applies to character names and the names of NUSI. --Asserts nothing about the matching rules for character name aliases. Both standards have enumerated lists of all character names (or relevant rules, in the case of CJK ideograph and Hangul syllable names) and all named character sequences. But the two standards *differ* in their treatment of character name aliases. The Unicode Standard has an explicitly normative list in NameAliases.txt in the UCD. 10646, on the other hand, only claims that aliases identified with the reference symbol in the names list are considered character name aliases bound by the uniqueness criteria of the namespace for names. By an interesting collection of happenstance, both published standards barely avoid having an explicit error for the two (formal) character name aliases in question: CANCEL and CANCEL CHARACTER. In the Unicode Standard, Version 6.2, it is asserted that LMR-A applies to character name aliases. Because the two character name aliases CANCEL and CANCEL CHARACTER do not match under LMR-A, they are formally allowed. ISO/IEC 10646, 3rd Edition asserts nothing about the matching rule which applies to character name aliases, although it does assert they share the same namespace with character names and names of NUSI, for which LMR-B applies. However, neither CANCEL nor CANCEL CHARACTER is listed in the names list of the charts as a character name alias, so there is no issue of whether they are formally allowed or not. Implications for Implementations There are two "in-house" utilities used by the editors of the Unicode Standard and 10646 to check for possible violations of name uniqueness criteria. One of those is built into the Unibook utility, and is used automatically during chart production to detect potential name collisions in new amendments or other charted names lists. It is asserted (by Unibook's author, Asmus Freytag) that Unibook uses LMR-B for detecting name matches. The other in-house utility is a small program, worddist, that I wrote and use as part of the release cycle, to check before publication that the UCD does not contain any entries violating the name uniqueness criteria. I assert that that program also uses LMR-B for detecting name matches. It was, in fact a recent check with that program which spotted the interpretation problem for the matching of the CANCEL and CANCEL CHARACTER aliases. There are also an unknown (and unknowable) number of other implementations of loose name matching rules in existence. Of particular importance may be loose name matching rules for implementations of regular expression matching, because those implementations externalize behavior which may then be baked into an unknown number of applications making use of regular expression matching. Of particular interest, because of its high degree of leverage, may be any details of loose name matching in ICU. Fixing the Problem There are several strategies which could be followed in an attempt to fix the problem. In an effort to focus the discussion and decision-making, I have outlined three potential approaches, along with their most apparent advantages and disadvantages. 1. The Null Strategy We could simply choose to change nothing in the specifications. Advantages: Hey, it's easy to do! And with careful parsing of the text, it is possible to make the case that the CANCEL versus CANCEL CHARACTER aliases don't actually violate any uniqueness constraints. We published those a year and a half ago, and nobody has complained yet, so... Disadvantages: Somebody besides me is likely to complain about the inconsistencies in the standards (especially that I've now laid out all the details for people to see), so the UTC and WG2 may end up being forced to make consistency changes eventually. And delaying fixes like these in the standards is almost always more costly than taking care of them earlier. 2. The LMR-A Strategy This strategy would emphasize the LMR-A rule, and attempt to make all the specifications consistent with that rule. The implied changes would consist of: * In UAX #34, tweak the statement of UAX34-R3 (and surrounding text) to make it clear that the *uniqueness* rule for the namespace is LMR-A, but that for practical reasons, the UTC also will not approve names for named sequences which differ from existing character names (or each other) only by the choice from the set in ... X { CHARACTER, LETTER, DIGIT } X ... * In 10646 Clause 24.5.4, add the phrase: ', character name aliases' delete the phrase: 'and even when the words "LETTER", "CHARACTER", and "DIGIT" are ignored' Instead, add a note pointing out that character names and names for NUSI also do not differ simply by a choice of "LETTER" versus "CHARACTER" versus "DIGIT" in the names. Together, these changes would then formally align 10646 with what the Unicode Standard would claim about name matching and name uniqueness. Advantages: This approach would minimize the amount of change in the Unicode specifications. It would keep the implementation of name matching for regex stable. It minimizes the behavioral changes. It would be easy to roll out in Unicode 6.3, because it only implies a local change in one UAX. It would synchronize the standards and clarify the intent of the namespace uniqueness. Disadvantages: This approach requires a larger change to the text of 10646, which could somehow be spun as "advantaging" Unicode and make the discussion in WG2 more fraught, even though it has no practical implication for the standard itself. This approach also disconnects the formal uniqueness criteria of the namespace from additional criteria that we might like to apply to prohibit certain types of name distinctions. (In effect, tools like Unibook would be applying the uniqueness criteria *plus* some other list of foldings not necessarily a part of LMR-A.) 3. The LMR-B Strategy This strategy would emphasize the LMR-B rule, and attempt to make all the specifications consistent with that rule. The implied changes would consist of (at least): * In TUS Section 4.8, update the specification about name matching to include ignoring "LETTER", "CHARACTER", and "DIGIT". * In UAX #44, Section 5.9.2, update UAX44-LM2 to include ignoring "LETTER", "CHARACTER", and "DIGIT". Add information to the migration section about the discontinuity of the rule between versions and how to cope with that discontinuity. * In UTS #18, Section 2.5, update the text about name matching to make it clear that UAX44-LM2 has changed between versions, and what the implications are for name matching. (This should include examples that show names that would not match by the earlier rule, but would match under the later rule: e.g. "CANCEL" and "CANCEL CHARACTER"). * Update the Unicode Character Encoding Stability Policy on name uniqueness, to make it explicitly follow the updated UAX44-LM2. * In 10646 Clause 24.5.4, add the phrase: ', character name aliases' This change would then formally align 10646 with what the Unicode Standard would claim about name matching and name uniqueness. Advantages: This approach formalizes a "stronger" uniqueness rule that we might like to apply to new names and aliases, anyway. This approach also minimizes the amount of text change needed to 10646 to bring the specifications into synch. This approach would also canonize the strategy already baked into Unibook for checking name uniqueness. It would synchronize the standards and clarify the intent of the namespace uniqueness. Disadvantages: Much more text needs to be changed in the specifications. The "CANCEL" aliases become another grandfathered exception that have to be baked into the uniqueness checking algorithms. Regex implementations are potentially destabilized. There are numerous implications for how the discontinuity between versions would have to be documented, and I haven't worked out all the text details for that here. This approach would effectively not be possible for Unicode 6.3, because it hits the core specification and a number of other documents, including the non-synchronized UTS #18. Another potentially serious disadvantage is that the scope of ignoring "CHARACTER", "LETTER", and "DIGIT" isn't exactly clear, so the rule might need further elaboration and examples added to make it clear.* Conclusion In the interest of disclosure, I should point out that I am very strongly in favor of Strategy #2 and very, very strongly opposed to Strategy #3. I think going down the route of Strategy #3 would be a major mistake and seriously destabilizing. However, I admit that others see things differently, and some have already come out more or less strongly for the basic approach I outline under Strategy #3. Because different people see the implications differently, I have made an attempt to lay out the main advantages and disadvantages of each approach as I see them, and would encourage others to re-evaluate and come up with their own assessment of potential advantages and disadvantages as part of the discussion. ========================================================================= * Note on the scope issue for ignoring "CHARACTER", "LETTER", or "DIGIT" From the examples given in the relevant text of the specifications, it is clear that the prototypical cases intended are like the following: SARATI LETTER AA SARATI CHARACTER AA SARATI DIGIT AA SARATI AA Those four names would all be considered "the same" under LMR-B criteria. However, both the statement of UAX34-R2 and the statement of the criteria in 10646 are ambiguous about just what "ignoring" means here. One possible interpretation, which one would derive from all the examples given, implies removal of the *whole word* "CHARACTER", "LETTER", or "DIGIT", bounded by spaces, before removal of spaces to do comparison. But in practice, it is almost certain that implementers will interpret ignoring as meaning "remove the substring". And there is a further ambiguity, in that removing the substring in question could occur either before or after removal of whitespace (unless the specification is quite clear about this), and the resulting sets of matches could in principle differ accordingly. Consider some of the following examples (nonexhaustive), for which it isn't immediately obvious what the intent of this ignore rule might have been. 1. Aliases and character names where the target string isn't used in its prototypical way in the names (i.e., not like the SARATI examples above): CANCEL CHARACTER CHARACTER TABULATION SINGLE CHARACTER INTRODUCER CHARACTER TIE <-- note, unexpectedly makes "TIE" an invalid name REPLACMENT CHARACTER NATIONAL DIGIT SHAPES DIGIT ONE FULL STOP DIGIT ONE COMMA LOVE LETTER <-- note, unexpectedly makes "LOVE" an invalid name 2. Character names where the target string isn't a whole word: MAHJONG TILE ONE OF CHARACTERS INPUT SYMBOL FOR LATIN LETTERS IMO, this kind of folding rule that folds out specific character strings (as opposed to unconditional removal of each instance of a space, for example) tends both to be more complicated to specify (and correspondingly fragile in implementation, because people may interpret the rule differently), *and* can lead to unexpected results and surprises. Who could expect, for example, a regex match for \p{name=TIE} to turn up a match for U+2040 CHARACTER TIE? Or what would happen if in addition to TAI LE LETTER THA, somebody decides that we need to encode TAI LE LETTER TTER? Does that surprisingly match "TAI" or not? Yeah, maybe stuff like this is goofy, and wouldn't happen, but then nobody thought "BELL" would be a problem, and nobody noticed "CANCEL" and "CANCEL CHARACTER" for a year and a half, either. Appendix ========================================================================================= Unicode 6.2 What the Unicode Standard, Version 6.2 (published 2012-09-26) actually *says* about name uniqueness: **************************************************************************************** 4.8 Name Character Name Matching. When matching identifiers transposed from character names, it is possible to ignore case, whitespace, and all medial hyphen-minus characters (or any “_” replacing a hyphen-minus), except for the hyphen-minus in U+1180 hangul jungseong o-e, and still result in a unique match. For example, “ZERO WIDTH SPACE” is equivalent to “zero-width-space” or “ZERO_WIDTH_SPACE” or “ZeroWidthSpace”. However, “TIBETAN LETTER A” should not match “TIBETAN LETTER -A”, because in that instance the hyphen-minus is not medial between two letters, but is instead preceded by a space. For more information on character name matching, see Section 5.7, “Matching Rules” in Unicode Standard Annex #44, “Unicode Character Database.” Named Character Sequences. Occasionally, character sequences are also given a normative name in the Unicode Standard. The names for such sequences are taken from the same namespace as character names, and are also unique. For details, see Unicode Standard Annex #34, “Unicode Named Character Sequences.” Named character sequences are not listed in the code charts; instead, they are listed in the file NamedSequences.txt in the Unicode Character Database. The names for named character sequences are also immutable. Once assigned, they will never be changed in subsequent versions of the Unicode Standard. Character Name Aliases. Sometimes errors in a character name are discovered after publication. Because character names are immutable, such errors are not corrected by changing the names. However, in some limited instances (as for obvious typos in a character name), the Unicode Standard publishes an additional, corrected name as a normative character name alias. (See Definition D5 in Section 3.3, Semantics.) Character name aliases are immutable once published and are also guaranteed to be unique in the namespace for character names. A character may, in principle, have more than one normative character name alias. Character name aliases which serve to correct errors in character names are listed in the code charts, using a special typographical convention explained in Section 17.1, Character Names List. They are also separately listed in the file NameAliases.txt in the Unicode Character Database. In addition to such corrections, the file NameAliases.txt contains aliases that give definitive labels to control codes, which have no actual Unicode character name. Additional aliases match existing and widely used alternative names and abbreviations for control codes and for Unicode format characters. Specifying these additional, normative character name aliases serves two major functions. First, it provides a set of well-defined aliases for use in regular expression matching and searching, where users might expect to be able to use established names or abbreviations for control codes and the like, but where those names or abbreviations are not part of the actual Unicode Name property. Second, because character name aliases are guaranteed to be unique in the Unicode namespace, having them defined for control codes and abbreviations prevents the potential for accidental collisions between de facto current use and names which might be chosen in the future for newly encoded Unicode characters. A normative character name alias is distinct from the informative aliases listed in the code charts. Informative aliases merely point out other common names in use for a given character. Informative aliases are not immutable and are not guaranteed to be unique; they therefore cannot serve as an identifier for a character. Their main purposes are to help readers of the standard to locate and to identify particular characters. **************************************************************************************** [UAX #44] 5.9.2 Matching Character Names Unicode character names constitute a special case. Formally, they are values of the Name property. While each Unicode character name for an assigned character is guaranteed to be unique, names are assigned in such a way that the presence or absence of spaces cannot be used to distinguish them. Furthermore, implementations sometimes create identifiers from Unicode character names by inserting underscores for spaces. For best results in comparing Unicode character names, use loose matching rule UAX44-LM2. UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E. •"zero-width space" is equivalent to "ZERO WIDTH SPACE" or "zerowidthspace" •"character -a" is not equivalent to "character a" In this rule "medial hyphen" is to be construed as a hyphen occurring immediately between two letters in the normative Unicode character name, as published in the Unicode names list, and not to any hyphen that may transiently occur medially as a result of removing whitespace before removing hyphens in a particular implementation of matching. Thus the hyphen in the name U+10089 LINEAR B IDEOGRAM B107M HE-GOAT is medial, and should be ignored in loose matching, but the hyphen in the name U+0F39 TIBETAN MARK TSA -PHRU is not medial, and should not be ignored in loose matching. An implementation of this loose matching rule can obtain the correct results when comparing two strings by doing the following three operations, in order: 1.remove all medial hyphens (except the medial hyphen in the name for U+1180) 2.remove all whitespace and underscore characters 3.apply toLowercase() to both strings After applying these three operations, if the two strings compare binary equal, then they are considered to match. This is a logical statement of how the rule works. If programmed carefully, an implementation of the matching rule can transform the strings in a single pass. It is also possible to compare two name strings for loose matching while transforming each string incrementally. Loose matching rule UAX44-LM2 is also appropriate for matching character name aliases and the names of named character sequences, which share the namespace (and matching behavior) of Unicode character names. See Section 4.8, Name in [Unicode] Implementations of name matching should use extreme care when matching non-standard, alternative names for particular characters. The Name Uniqueness Policy in the Unicode Consortium Stability Policies [Stability] guarantees that the Unicode Standard will never add a character whose name would match an existing encoded character, according to matching rule UAX44-LM2. However, any other name for a character might be used in the future. **************************************************************************************** [UAX #34] 4 Names Names of Unicode named character sequences are unique. They are part of the same namespace as Unicode character names. As a result, where a name exists as a character name, a modified name must be assigned instead. The same applies to not-yet-encoded characters. ... Names for named character sequences are constructed according to the following rules: UAX34-R1. Only Latin capital letters A to Z, digits 0 to 9 (provided that a digit is not the first character in a word), SPACE, and HYPHEN-MINUS are used for writing the names. UAX34-R2. Only one name is given to each named character sequence, and each named character sequence must have a unique name within the namespace that named character sequences share with character names. UAX34-R3. As for character names, names for sequences are unique if they are different even when SPACE and medial HYPHEN-MINUS characters are ignored, and when the strings “LETTER”, “CHARACTER”, and “DIGIT” are ignored in comparison of the names. The following two character names are exceptions to this rule, because they were created before this rule was specified: 116C HANGUL JUNGSEONG OE 1180 HANGUL JUNGSEONG O-E Examples of unacceptable names that are not unique: SARATI LETTER AA SARATI CHARACTER AA These two names would not be unique if the strings “LETTER” and “CHARACTER” were ignored. **************************************************************************************** What the Unicode Character Encoding Stability Policy has to say about name uniqueness: **************************************************************************************** Name Uniqueness Applicable Version: Unicode 2.0+ The names of characters, formal aliases, and named character sequences are unique within a shared namespace. The names of characters, named character sequences, and formal aliases for characters share a single namespace in which each name uniquely identifies either a single character or a single named character sequence. The definition of uniqueness is not just a simple comparison of the characters—instead, the loose matching rules from UAX #44, Unicode Character Database are used. Note: As of Unicode 4.1, named character sequences were added to this shared namespace; as of Unicode 5.0, formal aliases were also added. **************************************************************************************** What UTS #18, Unicode Regular Expressions, Version 15 (published 2012-07-17) has to say about name matching and name uniqueness: **************************************************************************************** [UTS #18] 2.5 Name Properties RL2.5 Name Properties To meet this requirement, an implementation shall support individually named characters. When using names in regular expressions, the data is supplied in both the Name (na) and Name_Alias properties in the UCD, as described in UAX #44: Unicode Character Database [UAX44], or computed as in the case of CJK Ideographs or Hangul Syllables. Name matching rules follow Matching Rules from [UAX44]. ... Implementers may add aliases beyond those recognized in the UCD. They must be aware that such additional aliases may cause problems if they collide with future character names or aliases. For example, implementations that used the name "BELL" for U+0007 broke when the new character U+1F514 ( ) BELL was introduced. 2.5.1 Individually Named Characters The following provides syntax for specifying a code point by supplying the precise name. This syntax specifies a single code point, which can thus be used in ranges. := "\N{" "}" The \N syntax is related to the syntax \p{name=...}, but there are three important distinctions: 1.\N matches a single character or a sequence, while \p matches a set of characters. 2.The \p{name=} may silently fail, if no character exists with that name. The \N syntax should instead cause a syntax error for an undefined name. 3.The \p{name=...} syntax can be used meaningfully with wildcards (see Section 2.6 Wildcards in Property Values). For example, in Unicode 6.1, \p{name=/ALIEN/} would designate a set of two characters: •U+1F47D ( ) EXTRATERRESTRIAL ALIEN, •U+1F47E ( ) ALIEN MONSTER 4.The namespace for the \p{name=...} syntax is the namespace for character names plus name aliases. The namespace for the \N syntax includes named sequences defined in NamedSequences.txt, such as \N{KHMER CONSONANT SIGN COENG KA}. Sequences behave as a single element, so \N{KHMER CONSONANT SIGN COENG KA}* should be treated as if it were the expression (\u{17D2 1780})*. As with other property values, names should use a loose match, disregarding case, spaces and hyphen (the underbar character "_" cannot occur in Unicode character names). An implementation may also choose to allow namespaces, where some prefix like "LATIN LETTER" is set globally and used if there is no match otherwise. There are, however, three instances that require special-casing with loose matching, where an extra test shall be made for the presence or absence of a hyphen. •U+0F68 TIBETAN LETTER A and U+0F60 TIBETAN LETTER -A •U+0FB8 TIBETAN SUBJOINED LETTER A and U+0FB0 TIBETAN SUBJOINED LETTER -A •U+116C HANGUL JUNGSEONG OE and U+1180 HANGUL JUNGSEONG O-E **************************************************************************************** ======================================================================================== 10646 3rd edition What 10646 3rd edition (published 2013-04-15) actually *says* about name uniqueness: **************************************************************************************** 24.5.3 Character names, character name aliases, and named UCS sequence identifiers Character names, character name aliases and named UCS sequence identifiers, taken together, constitute a name space. Each character name, character name aliases, or named UCS sequence identifier shall be unique and distinct from all other character names, character name aliases, or named UCS sequence identifiers. Clause 24.5.4 Determining Uniqueness For character names and named UCS sequence identifiers, two names shall be considered unique and distinct if they are different even when SPACE and medial HYPHEN-MINUS characters are ignored and even when the words "LETTER", "CHARACTER", and "DIGIT" are ignored in comparison of the names. The following two character names shall be considered unique and distinct: HANGUL JUNGSEONG OE HANGUL JUNGSEONG O-E NOTE 2 – These two character names are explicitly handled as an exception, because they were defined in an earlier version of this International Standard before the introduction of the name uniqueness requirement. This pair is, has been, and will be the only exception to the uniqueness rule in this International Standard. **************************************************************************************** In 10646 3rd edition, all character names and named UCS sequence identifiers are normatively listed. Only the character name aliases which are printed in the code charts have normative status. 10646 does not list *all* of the character name aliases that are listed in the UCD file, NameAliases.txt. In particular, the aliases "CANCEL" for U+0018 and "CANCEL CHARACTER" for U+0094 are printed as *informative* aliases in the names list for the 3rd edition. Hence, they do not fall under any normative prescriptions for name uniqueness in the 3rd edition. Note also that while Amendment 1 to the 3rd edition hit Clause 24 to account for the fact that the NUSI are now defined by reference to the data file instead of a table printed in the standard, the text *about* name uniqueness has not changed. =========================================================================================