Re: property, character, and sequence name loose matching

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Mar 09 2010 - 19:37:23 CST

  • Next message: CE Whitehead: "RE: Arabic aleph representation of glyphs"

    Karl Williamson asked:

    > The loose matching rules in TR18 say to ignore white space, underscores,
    > and hyphens. That means that someone could insert white space into the
    > middle of what is supposed to be a single word, like
    > \p{s c r i p t: greek}. Same for character names.

    Actually, it doesn't mean that you can arbitrarily ignore
    the identifier syntax of particular formalizations.

    What it means is that if you are matching particular
    property values from the Unicode Character Database,
    then such strings as "right above", "right_above" and "rightabove"
    (as well as case permutations such as "Right Above", "RIGHT_ABOVE",
    etc.) should all be considered as matching each other.

    > Someone has pointed out to me that UAX34 says this: "Like character
    > names, names for sequences are unique if they are different even when
    > SPACE and medial HYPHEN-MINUS characters are ignored". The term
    > "medial" isn't in TR18. That same someone pointed out that if you can
    > have spaces between characters in a word, that means the concept of
    > "medial" is meaningless.

    If you assume counterfactual premises, you can prove anything
    to be meaningless.

    >
    > Please explain what was meant.

    What it means is that such names as:

    CHARACTER BZZT
    CHARACTER B-ZZ-T
    CHARACTER BZ-ZT

    would be considered matches. And because they are matches
    by the loose matching rules for names and named sequences,
    the UTC is careful to ensure that different characters are
    not given such names, precisely because they are not considered
    distinct.

    CHARACTER BZZT
    CHARACTER BZZT-
    CHARACTER -BZZT

    would *NOT* be considered matches. So in principle it would
    be possible to have three different characters encoded with
    those three names.

    In practice the UTC doesn't actually use names like those,
    but there are a few Tibetan naming conventions that slipped
    in early on -- which is the reason for allowing non-medial hyphens
    in names (and keeping them distinct). To wit:

    U+0F60 TIBETAN LETTER -A
    U+0F68 TIBETAN LETTER A

    Those do *not* match.

    On the other hand, there is an exception written into the name
    matching rule because of some Korean Hangul characters. In
    particular:

    U+116C HANGUL JUNGESONG OE
    U+1180 HANGUL JUNGSEONG O-E

    also do *not* match. But in that case, it is a matter of
    particular exception, rather than general rule.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Mar 09 2010 - 19:40:11 CST