Loose Name Matching

L2/02-365R1

Re:	Loose Name Matching
From:	Mark Davis
Date:	2002-10-30

For property names, we recommend loose string matching: only letters (and possibly digits) are taken into account when matching. In particular, spaces and hyphens are disregarded in loose matching. The property aliases are vetted to make sure that this does not cause collisions: that the aliases will always remain distinct even if only letters and digits are considered in matching.

It is useful to do this for Unicode character names as well, for environments like regular expressions. There are currently only three cases where loose matching fails:

U+0F68 TIBETAN LETTER A and
U+0F60 TIBETAN LETTER -A
U+0FB8 TIBETAN SUBJOINED LETTER A and
U+0FB0 TIBETAN SUBJOINED LETTER -A
U+116C HANGUL JUNGSEONG OE and
U+1180 HANGUL JUNGSEONG O-E

With such a limited number of exceptions, one can still match loosely, by special-casing these exceptions. This can be done by setting up the lookup table to exclude U+0F60, U+0FB0, and U+1180, so that there is just one code point per transformed input string. Then when matching, the following process is used:

try a loose match
if there is no matching code point, return failure
if the matched code point is not one of U+0F68, U+0FB8, or U+116C, return the matched code point
if the last character but one (excluding trailing spaces) of the input is not "-", return the matched code point
otherwise return U+0F60, U+0FB0, or U+1180, respectively

As it turns out, the match can even be slightly looser than with property aliases: one can also remove all instances of the letter sequences "LETTER", "CHARACTER", "DIGIT", and still not have collisions.

I recommend that to make loose matching easy into the future, the UTC and WG2 should adopt the following policy:

Whenever a character name is assigned to a new character, that name will be distinct from all existing character names, even under the following transformation:
1. Remove all characters except for letters and decimal digits
  - Letters and decimal digits are those with general-category = L or Nd.
2. Remove remove all instances of the letter sequences "LETTER", "CHARACTER", "DIGIT"
  - This is only applicable to the English normative character names, not to translated names.
3. Case-fold all characters.
  - This is only applicable to translated names that may contain both uppercase and lowercase characters.