Re: property, character, and sequence name loose matching

From: karl williamson (
Date: Tue Mar 16 2010 - 21:18:13 CST

  • Next message: Jeroen Ruigrok van der Werven: "Using Unicode combining/combined characters to compress tweets"

    I haven't had time to mull this over, but below is a cut and paste of
    the salient portion of Section 2.5 of TR18

    Kenneth Whistler wrote:
    > Asmus (responding to Karl Williamson) noted:
    >> Fine, you've made your point that
    >> /*UAX44-LM2.*/ Ignore case, whitespace, underscore ('_'), and all
    >> medial hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
    >> * "zero-width space" is equivalent to "ZERO WIDTH SPACE" or
    >> "zerowidthspace"
    >> * "character -a" is /not/ equivalent to "character a"
    >> could be improved to note the interaction between the presence/absence
    >> of spaces and "medial". (I believe that's actually in the works).
    > Indeed:
    > for the Unicode 6.0 proposed update draft. Now is everybody's
    > chance to comment if anything about that clarification is still
    > problematical.
    >>> As an aside, it has been my experience that ignoring all white space
    >>> usually leads to unintended negative consequences. The 1966 ANSI
    >>> Fortran standard suffered from this (I don't know about later
    >>> versions), and it led to problems, with economic consequences. It is
    >>> a pity that this lesson did not get passed on to later generations. I
    >>> doubt that Unicode really wants 'S c r i p t' to mean 'Script', but
    >>> that's what it says. It would have been better in my opinion for it
    >>> to say that multiple white space is equivalent to a single white space.
    >> That's a good point, even though you misrepresent the intention of
    >> Unicode. Of interest here is not the folding of multiple spaces into one
    >> as much as allowing CamelCase version of names (instead of UPPER CASE or
    >> lower case with spaces).
    >> At the same time, it there are some names, esp. charater names, where
    >> users might disagree about where to add spaces. It was felt useful to
    >> allow the use not only of fewer spaces, but also of more spaces than the
    >> formal name.
    > There are examples such as U+003C LESS-THAN SIGN, where one wouldn't
    > want what might be a fairly common spelling for a match,
    > "less than sign" not to match the formal name "LESS-THAN SIGN".
    > In Hangul jamo letter names like U+112C HANGUL CHOSEONG
    > KAPYEOUNSSANGPIEUP, the last part is actually four syllables,
    > KAP YEOUN SSANG PIEUP, and you might not know where somebody
    > would or would not add spaces -- or hyphens, for that matter.
    > No one would *really* know where they should put hyphens or
    > spaces in U+238F OPEN-CIRCUIT-OUTPUT H-TYPE SYMBOL without looking
    > it up in the names list. ;-)
    > U+269C FLEUR-DE-LIS uses the *English* spelling which in most
    > dictionaries shows hyphens, but a French speaker would be more
    > likely to use "fleur de lis" without hyphens, since that is
    > the French spelling.
    > The point of a loose matching rule for character names like
    > this is to capture reasonable expectations about what people
    > might want to do in contexts like identifier, label, or
    > presentation, and still successfully match the
    > intended character.
    > And the standardization committees (UTC and WG2) are aware of
    > the loose matching rule for character names, and check against
    > it when creating new character names, so as not to introduce
    > character names that would be ambiguous under that loose
    > matching rule.
    >>> This is a false analogy because Unicode has never said that 'S' is to be
    >>> ignored in loose matching. Unicode still says (in TR18) that all
    >>> hyphens (except in 3 cases) are to be ignored. If hyphens can be
    >>> significant parts of character names, Unicode should never have said
    >>> they effectively aren't.
    >> UTS 18 is formally a different standard then the Unicode Standard (TUS)
    >> (which incorporates UAX#44).
    >> In this case, you are correct, UTS#18 is in conflict with UAX#44 and
    >> therefore TUS). The three cases may have been the only cases where
    >> hyphens resulted in a dinstinct name at the time UTS#18 was drafted, but
    >> it's clear that this approach is not robust, as long as UTC can add
    >> additional names under the slightly different rules of UAX#44.
    >> That should result in a correction/corrigendum for UTS#18.
    > Sorry, but I'm not seeing it.
    > The conformance requirements for claiming a level of
    > conformance in UTS #18, RL 1.5 Simple Loose Matches,
    > and RL 2.4 Default Loose Matches, have only to do with
    > case-insensitive matching for generic text, and do not
    > involve ignoring of whitespace, hyphens or underscores.
    > The only mention of loose matching of the type that we
    > are talking about is in Section 1.2 Properties, where it is
    > referring specifically to *property* names and values. And
    > there is it couched as a recommendation -- not a conformance
    > requirement:
    > "It is strongly recommended that both [long and short] property
    > names be recognized, and that loose matching of property names
    > be used, whereby the case distinctions, whitespace, hyphens,
    > and underbar are ignored."

    Section 2.5

    "As with other property values, names should use a loose match,
    disregarding case, spaces and hyphen (the underbar character "_" cannot
    occur in Unicode character names). An implementation may also choose to
    allow namespaces, where some prefix like "LATIN LETTER" is set globally
    and used if there is no match otherwise.

    There are, however, three instances that require special-casing with
    loose matching, where an extra test shall be made for the presence or
    absence of a hyphen.

         * U+0F68 TIBETAN LETTER A and
           U+0F60 TIBETAN LETTER -A
         * U+116C HANGUL JUNGSEONG OE and
           U+1180 HANGUL JUNGSEONG O-E

    > And as Asmus pointed out in an earlier note in this thread,
    > property names (or more exactly property aliases and
    > property value aliases) follow a different pattern than
    > character names. They are unambiguously interpretable if
    > you ignore all "case distinctions, whitespace, hyphens,
    > and underbar", because there are no funky edge cases
    > involving medial hyphens for those. In fact there are no
    > space characters whatsoever in any of the normative property
    > aliases or property value aliases in the Unicode Character
    > Database. And if somebody sticks a space (or spaces)
    > in a regex expression for something like \p{General Category:Lm}
    > instead of using \p{gc:Lm}, well, then the kindly (and
    > reasonable) thing for the regex engine to do would be
    > to ignore that space, as it is more likely to get the
    > expected result than it would by throwing a syntax exception.
    > The applicable loose matching rule in this case is not
    > the character names loose matching rule (UAX44-LM2), but
    > rather the symbolic values loose matching rule (UAX44-LM3).
    > Now granted this hasn't been spelled out explicitly in
    > the standard all that long -- the elaborations in UAX #44
    > are of fairly recent provenance. But this was nevertheless
    > the clear intent of the property alias files all along,
    > since they first were published as part of the UCD.
    > --Ken

    This archive was generated by hypermail 2.1.5 : Tue Mar 16 2010 - 21:23:10 CST