L2/02-365

Re: Loose Name Matching
From: Mark Davis
Date: 2002-10-29

For property names, we recommend loose string matching: only letters (and possibly digits) are taken into account when matching. In particular, spaces and hyphens are disregarded in loose matching. The property aliases are vetted to make sure that this does not cause collisions: that the aliases will always remain distinct even if only letters and digits are considered in matching.

It is useful to do this for Unicode character names as well, for environments like regular expressions. There are currently only three cases where loose matching fails:

U+0F60 TIBETAN LETTER -A and U+0F68 TIBETAN LETTER A
U+0FB0 TIBETAN SUBJOINED LETTER -A and U+0FB8 TIBETAN SUBJOINED LETTER A
U+116C HANGUL JUNGSEONG OE and U+1180 HANGUL JUNGSEONG O-E

With such a limited number of exceptions, one can still match loosely, by special casing these exceptions. This is done in the following way:

As it turns out, the match can even be slightly looser than with property aliases: one can also remove all instances of the letter sequences "LETTER", "CHARACTER", "DIGIT", and still not have collisions.

I recommend that to make loose matching easy into the future, the UTC and WG2 should adopt the following policy: