Loose Name Matching

L2/02-365R

Re:	Loose Name Matching
From:	Mark Davis
Date:	2002-10-29

For property names, we recommend loose string matching: only letters (and possibly digits) are taken into account when matching. In particular, spaces and hyphens are disregarded in loose matching. The property aliases are vetted to make sure that this does not cause collisions: that the aliases will always remain distinct even if only letters and digits are considered in matching.

It is useful to do this for Unicode character names as well, for environments like regular expressions. There are currently only three cases where loose matching fails:

U+0F60 TIBETAN LETTER -A and
U+0F68 TIBETAN LETTER A
U+0FB0 TIBETAN SUBJOINED LETTER -A and
U+0FB8 TIBETAN SUBJOINED LETTER A
U+116C HANGUL JUNGSEONG OE and
U+1180 HANGUL JUNGSEONG O-E

With such a limited number of exceptions, one can still match loosely, by special-casing these exceptions. This can be done in the following way:

try a loose match
if there is no match, return failure
if the matched code point is not one of U+0F68, U+0FB8, U+116C, return the matched code point
if the last character but one (excluding trailing spaces) of the input is not "-", return the matched code point
otherwise return U+0F60, U+0FB0, or U+116C respectively

As it turns out, the match can even be slightly looser than with property aliases: one can also remove all instances of the letter sequences "LETTER", "CHARACTER", "DIGIT", and still not have collisions.

I recommend that to make loose matching easy into the future, the UTC and WG2 should adopt the following policy:

Whenever a character name is assigned to a new character, that name will be distinct from all existing character names, even under the following transformation:
1. Remove all characters except for letters and decimal digits
  - Letters and decimal digits are those with general-category = L or Nd.
2. Remove remove all instances of the letter sequences "LETTER", "CHARACTER", "DIGIT"
  - This is only applicable to the English normative character names
3. Case-fold all characters.
  - This is only applicable to translated names that may contain both uppercase and lowercase characters.