Re: property, character, and sequence name loose matching

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Mar 09 2010 - 19:37:23 CST

Next message: CE Whitehead: "RE: Arabic aleph representation of glyphs"

Previous message: Kenneth Whistler: "Re: Policy on character name aliases?"
Maybe in reply to: karl williamson: "property, character, and sequence name loose matching"
Next in thread: Andrew West: "Re: property, character, and sequence name loose matching"
Reply: Andrew West: "Re: property, character, and sequence name loose matching"
Reply: karl williamson: "Re: property, character, and sequence name loose matching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Karl Williamson asked:

> The loose matching rules in TR18 say to ignore white space, underscores,
> and hyphens. That means that someone could insert white space into the
> middle of what is supposed to be a single word, like
> \p{s c r i p t: greek}. Same for character names.

Actually, it doesn't mean that you can arbitrarily ignore
the identifier syntax of particular formalizations.

What it means is that if you are matching particular
property values from the Unicode Character Database,
then such strings as "right above", "right_above" and "rightabove"
(as well as case permutations such as "Right Above", "RIGHT_ABOVE",
etc.) should all be considered as matching each other.

> Someone has pointed out to me that UAX34 says this: "Like character
> names, names for sequences are unique if they are different even when
> SPACE and medial HYPHEN-MINUS characters are ignored". The term
> "medial" isn't in TR18. That same someone pointed out that if you can
> have spaces between characters in a word, that means the concept of
> "medial" is meaningless.

If you assume counterfactual premises, you can prove anything
to be meaningless.

>
> Please explain what was meant.

What it means is that such names as:

CHARACTER BZZT
CHARACTER B-ZZ-T
CHARACTER BZ-ZT

would be considered matches. And because they are matches
by the loose matching rules for names and named sequences,
the UTC is careful to ensure that different characters are
not given such names, precisely because they are not considered
distinct.

CHARACTER BZZT
CHARACTER BZZT-
CHARACTER -BZZT

would *NOT* be considered matches. So in principle it would
be possible to have three different characters encoded with
those three names.

In practice the UTC doesn't actually use names like those,
but there are a few Tibetan naming conventions that slipped
in early on -- which is the reason for allowing non-medial hyphens
in names (and keeping them distinct). To wit:

U+0F60 TIBETAN LETTER -A
U+0F68 TIBETAN LETTER A

Those do *not* match.

On the other hand, there is an exception written into the name
matching rule because of some Korean Hangul characters. In
particular:

U+116C HANGUL JUNGESONG OE
U+1180 HANGUL JUNGSEONG O-E

also do *not* match. But in that case, it is a matter of
particular exception, rather than general rule.

--Ken

Next message: CE Whitehead: "RE: Arabic aleph representation of glyphs"
Previous message: Kenneth Whistler: "Re: Policy on character name aliases?"
Maybe in reply to: karl williamson: "property, character, and sequence name loose matching"
Next in thread: Andrew West: "Re: property, character, and sequence name loose matching"
Reply: Andrew West: "Re: property, character, and sequence name loose matching"
Reply: karl williamson: "Re: property, character, and sequence name loose matching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Mar 09 2010 - 19:40:11 CST