L2/13-142

Title:  Inconsistencies in Name Uniqueness Criteria

Author: Ken Whistler

Date:   July 16, 2013

Status: For consideration by the UTC

Summary

There are several inconsistencies in the statements about the criteria for
name uniqueness, both in Unicode 6.2 and in ISO/IEC 10646 3rd Edition. I think
it would be best to try to address and resolve these inconsistencies in
the context of the preparation of Unicode 6.3, and in ballot comments on
the text of the DIS for ISO/IEC 10646 4th Edition, so that the statements
and criteria would both be internally consistent in the standards and
be synchronized between the two standards. This contribution provides
background information, analysis, and a presentation about alternative
approaches the UTC might choose to take to the problem.

Background

The problem turned up during a recent check on name uniqueness, which uncovered
the fact that two (formal) character name aliases are not unique under
one interpretation of the name uniqueness criteria. The two character name
aliases in question are:

CANCEL, character name alias for U+0018

CANCEL CHARACTER, character name alias for U+0094

These two aliases were added to NameAliases.txt in Unicode 6.1, so they have
been published in the Unicode Standard since January 31, 2012.

The problem is that on one interpretation of the name uniqueness criteria, it
should not be allowable to have character name aliases which differ only by
the presence or absence of the string "CHARACTER" in the names. If that
interpretation is correct, then the UTC made a mistake in Unicode 6.1, and
some kind of correction is in order, perhaps to grandfather in the mistake
in the statement of the rules. If that interpretation is *not* correct, then
the relevant text in the standard should probably be adjusted to ensure
that it is internally consistent and does not lead people to an incorrect
interpretation of the criteria.

The issue derives from the fact that character names, (formal) character name
aliases, and the names of named Unicode character sequences are all designated
as sharing the same namespace. However, there is not, and never has been any
formal syntactic definition of that namespace. Instead, we have gotten by with
various statements of "loose matching rules" for determining whether two names
match (or not), and have assumed that those statements are internally consistent
and also result in a well-defined namespace. 

Unfortunately, however, we in fact have longstanding inconsistent statements about matching 
rules, and those inconsistent statements have also been inconsistently extended
in applicability as the namespace for character names was extended, first to
include the names for named Unicode character sequences and then the (formal)
character name aliases. Finally, the attempts to keep 10646 in synch with
the Unicode Standard in this regard have themselves also been inconsistent.

The result is somewhat of a textual and conformance mess, with different
experts asserting different claims about what names match and what names do
not, as well as a whiff of incipient fingerpointing regarding the responsibility for
the "mistake" noted above for CANCEL and CANCEL CHARACTER.

Analysis

The full textual details of what is currently published in the two standards regarding
this question are provided below in the Appendix section. This should make it
much easier to compare text regarding the current state of affairs, without
people having to fumble around online comparing 4 different source pages for
the Unicode Standard (and policies) and the more difficult-to-locate current
text of 10646. I also have included a related excerpt from UTS #18, Unicode Regular
Expressions, which has a bearing on this issue. [The loose matching rules in
UTS #35 are not relevant to the name matching issue discussed here.]

Boiling it down, there are essentially two versions of the "loose" matching rules
in the specifications:

LMR-A: Ignore (i.e., fold away) any casing distinctions, spaces, and any medial 
hyphen-minus characters in names. Compare the resulting strings. If the folded
strings are binary equal, then the names match. There is one grandfathered
exception: U+1180 HANGUL JUNGSEONG O-E does not match U+116C HANGUL JUNGSEONG OE.

LMR-B: Identical to LMR-A, except that one also ignores (i.e., folds away) any
substring "CHARACTER", "LETTER", or "DIGIT".

As the text stands currently, we have the following situation.

The Unicode Standard, Version 6.2

--Asserts that character names, character name aliases, and the names of Unicode
named character sequences share the same namespace.

--Asserts that LMR-A applies to character names and character name aliases.

--Asserts (inconsistently) that either LMR-A or LMR-B applies to the names of
Unicode named character sequences.

ISO/IEC 10646 3rd Edition

--Asserts that character names, character name aliases, and the names of NUSI
share the same namespace.

--Asserts that LMR-B applies to character names and the names of NUSI.

--Asserts nothing about the matching rules for character name aliases.

Both standards have enumerated lists of all character names (or relevant rules,
in the case of CJK ideograph and Hangul syllable names) and all named character
sequences. But the two standards *differ* in their treatment of character name aliases.
The Unicode Standard has an explicitly normative list in NameAliases.txt in the UCD.
10646, on the other hand, only claims that aliases identified with
the reference symbol in the names list are considered character name aliases
bound by the uniqueness criteria of the namespace for names.

By an interesting collection of happenstance, both published standards barely
avoid having an explicit error for the two (formal) character name aliases in question:
CANCEL and CANCEL CHARACTER.

In the Unicode Standard, Version 6.2, it is asserted that LMR-A applies to character
name aliases. Because the two character name aliases CANCEL and CANCEL CHARACTER
do not match under LMR-A, they are formally allowed.

ISO/IEC 10646, 3rd Edition asserts nothing about the matching rule which applies
to character name aliases, although it does assert they share the same namespace
with character names and names of NUSI, for which LMR-B applies. However, neither
CANCEL nor CANCEL CHARACTER is listed in the names list of the charts as a
character name alias, so there is no issue of whether they are formally allowed
or not.

Implications for Implementations

There are two "in-house" utilities used by the editors of the Unicode Standard
and 10646 to check for possible violations of name uniqueness criteria. One of
those is built into the Unibook utility, and is used automatically during chart
production to detect potential name collisions in new amendments or other
charted names lists. It is asserted (by Unibook's author, Asmus Freytag) that
Unibook uses LMR-B for detecting name matches.

The other in-house utility is a small program, worddist, that I wrote and use 
as part of the release cycle, to check before publication that the UCD does
not contain any entries violating the name uniqueness criteria. I assert
that that program also uses LMR-B for detecting name matches. It was, in fact
a recent check with that program which spotted the interpretation problem for the
matching of the CANCEL and CANCEL CHARACTER aliases.

There are also an unknown (and unknowable) number of other implementations of
loose name matching rules in existence. Of particular importance may be
loose name matching rules for implementations of regular expression matching,
because those implementations externalize behavior which may then be baked into
an unknown number of applications making use of regular expression matching.

Of particular interest, because of its high degree of leverage, may be any
details of loose name matching in ICU.

Fixing the Problem

There are several strategies which could be followed in an attempt to fix
the problem. In an effort to focus the discussion and decision-making, I
have outlined three potential approaches, along with their most apparent
advantages and disadvantages.

1. The Null Strategy

We could simply choose to change nothing in the specifications.

Advantages: Hey, it's easy to do! And with careful parsing of the text,
it is possible to make the case that the CANCEL versus CANCEL CHARACTER
aliases don't actually violate any uniqueness constraints. We published
those a year and a half ago, and nobody has complained yet, so...

Disadvantages: Somebody besides me is likely to complain about the
inconsistencies in the standards (especially that I've now laid out
all the details for people to see), so the UTC and WG2 may end up being
forced to make consistency changes eventually. And delaying fixes like
these in the standards is almost always more costly than taking care of
them earlier.

2. The LMR-A Strategy

This strategy would emphasize the LMR-A rule, and attempt to make all
the specifications consistent with that rule. The implied changes would
consist of:

* In UAX #34, tweak the statement of UAX34-R3 (and surrounding text) to
make it clear that the *uniqueness* rule for the namespace is LMR-A, but
that for practical reasons, the UTC also will not approve names for
named sequences which differ from existing character names (or each other)
only by the choice from the set in ... X { CHARACTER, LETTER, DIGIT } X ...

* In 10646 Clause 24.5.4, add the phrase:

  ', character name aliases'

delete the phrase:

  'and even when the words "LETTER", "CHARACTER", and "DIGIT" are ignored'

Instead, add a note pointing out that character names and names for NUSI also do not
differ simply by a choice of "LETTER" versus "CHARACTER" versus "DIGIT" in the
names.

Together, these changes would then formally align 10646 with what
the Unicode Standard would claim about name matching and name uniqueness.

Advantages: This approach would minimize the amount of change in the Unicode
specifications. It would keep the implementation of name matching for
regex stable. It minimizes the behavioral changes. It would be easy
to roll out in Unicode 6.3, because it only implies a local change
in one UAX. It would synchronize 
the standards and clarify the intent of the namespace uniqueness.

Disadvantages: This approach requires a larger change to the text of
10646, which could somehow be spun as "advantaging" Unicode
and make the discussion in WG2 more fraught, even though it has no
practical implication for the standard itself. This approach also
disconnects the formal uniqueness criteria of the namespace from
additional criteria that we might like to apply to prohibit certain
types of name distinctions. (In effect, tools like Unibook would be
applying the uniqueness criteria *plus* some other list of foldings
not necessarily a part of LMR-A.)

3. The LMR-B Strategy

This strategy would emphasize the LMR-B rule, and attempt to make all
the specifications consistent with that rule. The implied changes would
consist of (at least):

* In TUS Section 4.8, update the specification about name matching to
include ignoring "LETTER", "CHARACTER", and "DIGIT".

* In UAX #44, Section 5.9.2, update UAX44-LM2 to include ignoring "LETTER",
"CHARACTER", and "DIGIT". Add information to the migration section about
the discontinuity of the rule between versions and how to cope with that
discontinuity.

* In UTS #18, Section 2.5, update the text about name matching to make
it clear that UAX44-LM2 has changed between versions, and what the
implications are for name matching. (This should include examples that
show names that would not match by the earlier rule, but would match
under the later rule: e.g. "CANCEL" and "CANCEL CHARACTER").

* Update the Unicode Character Encoding Stability Policy on name
uniqueness, to make it explicitly follow the updated UAX44-LM2.

* In 10646 Clause 24.5.4, add the phrase:

  ', character name aliases'

This change would then formally align 10646 with what
the Unicode Standard would claim about name matching and name uniqueness.

Advantages: This approach formalizes a "stronger" uniqueness rule that
we might like to apply to new names and aliases, anyway. This approach
also minimizes the amount of text change needed to 10646 to bring the
specifications into synch. This approach would also canonize the
strategy already baked into Unibook for checking name uniqueness.
It would synchronize the standards and clarify the intent of the 
namespace uniqueness.

Disadvantages: Much more text needs to be changed in the specifications.
The "CANCEL" aliases become another grandfathered exception that have
to be baked into the uniqueness checking algorithms. Regex implementations
are potentially destabilized. There are numerous implications for how
the discontinuity between versions would have to be documented, and I
haven't worked out all the text details for that here. This approach
would effectively not be possible for Unicode 6.3, because it hits
the core specification and a number of other documents, including
the non-synchronized UTS #18. Another potentially
serious disadvantage is that the scope of ignoring "CHARACTER", "LETTER",
and "DIGIT" isn't exactly clear, so the rule might need further
elaboration and examples added to make it clear.*

Conclusion

In the interest of disclosure, I should point out that I am very
strongly in favor of Strategy #2 and very, very strongly opposed to
Strategy #3. I think going down the route of Strategy #3 would be a
major mistake and seriously destabilizing.

However, I admit that others see things differently, and some have
already come out more or less strongly for the basic approach I
outline under Strategy #3. Because different people see the
implications differently, I have made an attempt to lay out the
main advantages and disadvantages of each approach as I see them,
and would encourage others to re-evaluate and come up with their own
assessment of potential advantages and disadvantages as part of
the discussion.

=========================================================================

* Note on the scope issue for ignoring "CHARACTER", "LETTER", or "DIGIT"

From the examples given in the relevant text of the specifications,
it is clear that the prototypical cases intended are like the
following:

SARATI LETTER AA
SARATI CHARACTER AA
SARATI DIGIT AA
SARATI AA

Those four names would all be considered "the same" under LMR-B
criteria. However, both the statement of UAX34-R2 and the statement
of the criteria in 10646 are ambiguous about just what "ignoring"
means here. One possible interpretation, which one would derive
from all the examples given, implies removal of the *whole word*
"CHARACTER", "LETTER", or "DIGIT", bounded by spaces, before removal
of spaces to do comparison.

But in practice, it is almost certain that implementers will interpret
ignoring as meaning "remove the substring". And there is a further
ambiguity, in that removing the substring in question could occur
either before or after removal of whitespace (unless the specification
is quite clear about this), and the resulting sets of matches could
in principle differ accordingly.

Consider some of the following examples (nonexhaustive), for which it isn't immediately
obvious what the intent of this ignore rule might have been.

1. Aliases and character names where the target string isn't used
in its prototypical way in the names (i.e., not like the SARATI
examples above):

CANCEL CHARACTER
CHARACTER TABULATION
SINGLE CHARACTER INTRODUCER
CHARACTER TIE <-- note, unexpectedly makes "TIE" an invalid name
REPLACMENT CHARACTER
NATIONAL DIGIT SHAPES
DIGIT ONE FULL STOP
DIGIT ONE COMMA
LOVE LETTER <-- note, unexpectedly makes "LOVE" an invalid name

2. Character names where the target string isn't a whole word:

MAHJONG TILE ONE OF CHARACTERS
INPUT SYMBOL FOR LATIN LETTERS

IMO, this kind of folding rule that folds out specific character
strings (as opposed to unconditional removal of each instance
of a space, for example) tends both to be more complicated to
specify (and correspondingly fragile in implementation, because
people may interpret the rule differently), *and* can lead to
unexpected results and surprises. Who could expect, for example,
a regex match for \p{name=TIE} to turn up a match for U+2040
CHARACTER TIE?

Or what would happen if in addition to TAI LE LETTER THA, somebody
decides that we need to encode TAI LE LETTER TTER? Does that
surprisingly match "TAI" or not?

Yeah, maybe stuff like this is goofy, and wouldn't happen, but
then nobody thought "BELL" would be a problem, and nobody
noticed "CANCEL" and "CANCEL CHARACTER" for a year and a half,
either.


Appendix

=========================================================================================

Unicode 6.2

What the Unicode Standard, Version 6.2 (published 2012-09-26) actually *says* about 
name uniqueness:

****************************************************************************************

4.8 Name

Character Name Matching. When matching identifiers transposed from character names,
it is possible to ignore case, whitespace, and all medial hyphen-minus characters (or any “_”
replacing a hyphen-minus), except for the hyphen-minus in U+1180 hangul jungseong o-e,
and still result in a unique match. For example, “ZERO WIDTH SPACE” is equivalent to
“zero-width-space” or “ZERO_WIDTH_SPACE” or “ZeroWidthSpace”. However,
“TIBETAN LETTER A” should not match “TIBETAN LETTER -A”, because in that instance
the hyphen-minus is not medial between two letters, but is instead preceded by a space. For
more information on character name matching, see Section 5.7, “Matching Rules” in Unicode
Standard Annex #44, “Unicode Character Database.”

Named Character Sequences. Occasionally, character sequences are also given a normative
name in the Unicode Standard. The names for such sequences are taken from the same
namespace as character names, and are also unique. For details, see Unicode Standard
Annex #34, “Unicode Named Character Sequences.” Named character sequences are not
listed in the code charts; instead, they are listed in the file NamedSequences.txt in the Unicode
Character Database.

The names for named character sequences are also immutable. Once assigned, they will
never be changed in subsequent versions of the Unicode Standard.

Character Name Aliases. Sometimes errors in a character name are discovered after publication.
Because character names are immutable, such errors are not corrected by changing
the names. However, in some limited instances (as for obvious typos in a character name),
the Unicode Standard publishes an additional, corrected name as a normative character
name alias. (See Definition D5 in Section 3.3, Semantics.) Character name aliases are
immutable once published and are also guaranteed to be unique in the namespace for character
names. A character may, in principle, have more than one normative character name
alias.

Character name aliases which serve to correct errors in character names are listed in the
code charts, using a special typographical convention explained in Section 17.1, Character
Names List. They are also separately listed in the file NameAliases.txt in the Unicode Character
Database.

In addition to such corrections, the file NameAliases.txt contains aliases that give definitive
labels to control codes, which have no actual Unicode character name. Additional aliases
match existing and widely used alternative names and abbreviations for control codes and
for Unicode format characters. Specifying these additional, normative character name
aliases serves two major functions. First, it provides a set of well-defined aliases for use in
regular expression matching and searching, where users might expect to be able to use
established names or abbreviations for control codes and the like, but where those names
or abbreviations are not part of the actual Unicode Name property. Second, because character
name aliases are guaranteed to be unique in the Unicode namespace, having them
defined for control codes and abbreviations prevents the potential for accidental collisions
between de facto current use and names which might be chosen in the future for newly
encoded Unicode characters.

A normative character name alias is distinct from the informative aliases listed in the code
charts. Informative aliases merely point out other common names in use for a given character.
Informative aliases are not immutable and are not guaranteed to be unique; they
therefore cannot serve as an identifier for a character. Their main purposes are to help
readers of the standard to locate and to identify particular characters.

****************************************************************************************

[UAX #44] 5.9.2 Matching Character Names

Unicode character names constitute a special case. Formally, they are values of the Name property. 
While each Unicode character name for an assigned character is guaranteed to be unique, names 
are assigned in such a way that the presence or absence of spaces cannot be used to distinguish 
them. Furthermore, implementations sometimes create identifiers from Unicode character names 
by inserting underscores for spaces. For best results in comparing Unicode character names, 
use loose matching rule UAX44-LM2.

UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial hyphens except the hyphen 
in U+1180 HANGUL JUNGSEONG O-E.

•"zero-width space" is equivalent to "ZERO WIDTH SPACE" or "zerowidthspace"
•"character -a" is not equivalent to "character a"

In this rule "medial hyphen" is to be construed as a hyphen occurring immediately between 
two letters in the normative Unicode character name, as published in the Unicode names list, 
and not to any hyphen that may transiently occur medially as a result of removing whitespace 
before removing hyphens in a particular implementation of matching. Thus the hyphen in the 
name U+10089 LINEAR B IDEOGRAM B107M HE-GOAT is medial, and should be ignored in loose matching, 
but the hyphen in the name U+0F39 TIBETAN MARK TSA -PHRU is not medial, and should not be 
ignored in loose matching.

An implementation of this loose matching rule can obtain the correct results when comparing 
two strings by doing the following three operations, in order:

1.remove all medial hyphens (except the medial hyphen in the name for U+1180)
2.remove all whitespace and underscore characters
3.apply toLowercase() to both strings

After applying these three operations, if the two strings compare binary equal, then they 
are considered to match.

This is a logical statement of how the rule works. If programmed carefully, an implementation 
of the matching rule can transform the strings in a single pass. It is also possible to compare 
two name strings for loose matching while transforming each string incrementally.

Loose matching rule UAX44-LM2 is also appropriate for matching character name aliases and the 
names of named character sequences, which share the namespace (and matching behavior) of 
Unicode character names. See Section 4.8, Name in [Unicode]

Implementations of name matching should use extreme care when matching non-standard, alternative 
names for particular characters. The Name Uniqueness Policy in the Unicode Consortium Stability 
Policies [Stability] guarantees that the Unicode Standard will never add a character whose name 
would match an existing encoded character, according to matching rule UAX44-LM2. However, any 
other name for a character might be used in the future.

****************************************************************************************

[UAX #34] 4 Names

Names of Unicode named character sequences are unique. They are part of the same namespace as
Unicode character names. As a result, where a name exists as a character name, a modified 
name must be assigned instead. The same applies to not-yet-encoded characters.

...

Names for named character sequences are constructed according to the following rules:

UAX34-R1. Only Latin capital letters A to Z, digits 0 to 9 (provided that a digit is 
not the first character in a word), SPACE, and HYPHEN-MINUS are used for writing the names.

UAX34-R2. Only one name is given to each named character sequence, and each named character 
sequence must have a unique name within the namespace that named character sequences share 
with character names.

UAX34-R3. As for character names, names for sequences are unique if they are different even 
when SPACE and medial HYPHEN-MINUS characters are ignored, and when the strings “LETTER”, 
“CHARACTER”, and “DIGIT” are ignored in comparison of the names.

The following two character names are exceptions to this rule, because they were created 
before this rule was specified:

116C HANGUL JUNGSEONG OE
1180 HANGUL JUNGSEONG O-E

Examples of unacceptable names that are not unique:

SARATI LETTER AA
SARATI CHARACTER AA

These two names would not be unique if the strings “LETTER” and “CHARACTER” were ignored.

****************************************************************************************

What the Unicode Character Encoding Stability Policy has to say about name uniqueness:

****************************************************************************************

Name Uniqueness

Applicable Version: Unicode 2.0+

The names of characters, formal aliases, and named character sequences are unique within 
a shared namespace.

The names of characters, named character sequences, and formal aliases for characters 
share a single namespace in which each name uniquely identifies either a single character 
or a single named character sequence. The definition of uniqueness is not just a simple 
comparison of the characters—instead, the loose matching rules from UAX #44, 
Unicode Character Database are used.

Note: As of Unicode 4.1, named character sequences were added to this shared namespace;
as of Unicode 5.0, formal aliases were also added.

****************************************************************************************

What UTS #18, Unicode Regular Expressions, Version 15 (published 2012-07-17) has to 
say about name matching and name uniqueness:

****************************************************************************************

[UTS #18] 2.5 Name Properties

RL2.5 Name Properties 
 To meet this requirement, an implementation shall support individually named characters. 

When using names in regular expressions, the data is supplied in both the Name (na) and
Name_Alias properties in the UCD, as described in UAX #44: Unicode Character Database 
[UAX44], or computed as in the case of CJK Ideographs or Hangul Syllables. 
Name matching rules follow Matching Rules from [UAX44].

...

Implementers may add aliases beyond those recognized in the UCD. They must be aware that 
such additional aliases may cause problems if they collide with future character names or 
aliases. For example, implementations that used the name "BELL" for U+0007 broke when the 
new character U+1F514 ( ) BELL was introduced.

2.5.1 Individually Named Characters

The following provides syntax for specifying a code point by supplying the precise 
name. This syntax specifies a single code point, which can thus be used in ranges.

<codepoint> := "\N{" <character_name> "}" 

The \N syntax is related to the syntax \p{name=...}, but there are three important distinctions:

1.\N matches a single character or a sequence, while \p matches a set of characters.

2.The \p{name=<character_name>} may silently fail, if no character exists with that name. 
The \N syntax should instead cause a syntax error for an undefined name.

3.The \p{name=...} syntax can be used meaningfully with wildcards (see Section 2.6 Wildcards 
in Property Values). For example, in Unicode 6.1, \p{name=/ALIEN/} would designate a 
set of two characters:

•U+1F47D ( ) EXTRATERRESTRIAL ALIEN,
•U+1F47E ( ) ALIEN MONSTER

4.The namespace for the \p{name=...} syntax is the namespace for character names plus 
name aliases. The namespace for the \N syntax includes named sequences defined in 
NamedSequences.txt, such as \N{KHMER CONSONANT SIGN COENG KA}. Sequences behave as a 
single element, so \N{KHMER CONSONANT SIGN COENG KA}* should be treated as if it were 
the expression (\u{17D2 1780})*. 

As with other property values, names should use a loose match, disregarding case, 
spaces and hyphen (the underbar character "_" cannot occur in Unicode character names).
An implementation may also choose to allow namespaces, where some prefix like 
"LATIN LETTER" is set globally and used if there is no match otherwise.

There are, however, three instances that require special-casing with loose matching, 
where an extra test shall be made for the presence or absence of a hyphen.

•U+0F68 TIBETAN LETTER A and
U+0F60 TIBETAN LETTER -A 
•U+0FB8 TIBETAN SUBJOINED LETTER A and
U+0FB0 TIBETAN SUBJOINED LETTER -A 
•U+116C HANGUL JUNGSEONG OE and
U+1180 HANGUL JUNGSEONG O-E

****************************************************************************************

========================================================================================

10646 3rd edition

What 10646 3rd edition (published 2013-04-15) actually *says* about name uniqueness:

****************************************************************************************

24.5.3 Character names, character name aliases, and named UCS sequence identifiers

Character names, character name aliases and named UCS sequence identifiers, taken 
together, constitute a name space. Each character name, character name aliases, 
or named UCS sequence identifier shall be unique and distinct from all other 
character names, character name aliases, or named UCS sequence identifiers.

Clause 24.5.4 Determining Uniqueness

For character names and named UCS sequence identifiers, two names shall be considered 
unique and distinct if they are different even when SPACE and medial HYPHEN-MINUS 
characters are ignored and even when the words "LETTER", "CHARACTER", and "DIGIT" 
are ignored in comparison of the names.

The following two character names shall be considered unique and distinct:
HANGUL JUNGSEONG OE
HANGUL JUNGSEONG O-E

NOTE 2 – These two character names are explicitly handled as an exception, because they
were defined in an earlier version of this International Standard before the introduction 
of the name uniqueness requirement. This pair is, has been, and will be the only 
exception to the uniqueness rule in this International Standard.

****************************************************************************************

In 10646 3rd edition, all character names and named UCS sequence identifiers are
normatively listed. Only the character name aliases which are printed in the code
charts have normative status. 10646 does not list *all* of the character name aliases
that are listed in the UCD file, NameAliases.txt. In particular, the aliases "CANCEL"
for U+0018 and "CANCEL CHARACTER" for U+0094 are printed as *informative* aliases in
the names list for the 3rd edition. Hence, they do not fall under any normative
prescriptions for name uniqueness in the 3rd edition.

Note also that while Amendment 1 to the 3rd edition hit Clause 24 to account
for the fact that the NUSI are now defined by reference to the data file instead
of a table printed in the standard, the text *about* name uniqueness has not changed.

=========================================================================================