Re: Loose Name Matching
From: Mark Davis
Date: 2002-10-30

For property names, we recommend loose string matching: only letters (and possibly digits) are taken into account when matching. In particular, spaces and hyphens are disregarded in loose matching. The property aliases are vetted to make sure that this does not cause collisions: that the aliases will always remain distinct even if only letters and digits are considered in matching.

It is useful to do this for Unicode character names as well, for environments like regular expressions. There are currently only three cases where loose matching fails:

With such a limited number of exceptions, one can still match loosely, by special-casing these exceptions. This can be done by setting up the lookup table to exclude U+0F60, U+0FB0, and U+1180, so that there is just one code point per transformed input string. Then when matching, the following process is used:

As it turns out, the match can even be slightly looser than with property aliases: one can also remove all instances of the letter sequences "LETTER", "CHARACTER", "DIGIT", and still not have collisions.

I recommend that to make loose matching easy into the future, the UTC and WG2 should adopt the following policy: