Re: property, character, and sequence name loose matching

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Mar 16 2010 - 17:57:48 CST

  • Next message: karl williamson: "Re: property, character, and sequence name loose matching"

    Asmus (responding to Karl Williamson) noted:

    > Fine, you've made your point that
    >
    > /*UAX44-LM2.*/ Ignore case, whitespace, underscore ('_'), and all
    > medial hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
    >
    > * "zero-width space" is equivalent to "ZERO WIDTH SPACE" or
    > "zerowidthspace"
    > * "character -a" is /not/ equivalent to "character a"
    >
    > could be improved to note the interaction between the presence/absence
    > of spaces and "medial". (I believe that's actually in the works).

    Indeed:

    http://www.unicode.org/reports/tr44/tr44-5.html#Matching_Rules

    for the Unicode 6.0 proposed update draft. Now is everybody's
    chance to comment if anything about that clarification is still
    problematical.

    > > As an aside, it has been my experience that ignoring all white space
    > > usually leads to unintended negative consequences. The 1966 ANSI
    > > Fortran standard suffered from this (I don't know about later
    > > versions), and it led to problems, with economic consequences. It is
    > > a pity that this lesson did not get passed on to later generations. I
    > > doubt that Unicode really wants 'S c r i p t' to mean 'Script', but
    > > that's what it says. It would have been better in my opinion for it
    > > to say that multiple white space is equivalent to a single white space.
    >
    > That's a good point, even though you misrepresent the intention of
    > Unicode. Of interest here is not the folding of multiple spaces into one
    > as much as allowing CamelCase version of names (instead of UPPER CASE or
    > lower case with spaces).
    >
    > At the same time, it there are some names, esp. charater names, where
    > users might disagree about where to add spaces. It was felt useful to
    > allow the use not only of fewer spaces, but also of more spaces than the
    > formal name.

    There are examples such as U+003C LESS-THAN SIGN, where one wouldn't
    want what might be a fairly common spelling for a match,
    "less than sign" not to match the formal name "LESS-THAN SIGN".

    In Hangul jamo letter names like U+112C HANGUL CHOSEONG
    KAPYEOUNSSANGPIEUP, the last part is actually four syllables,
    KAP YEOUN SSANG PIEUP, and you might not know where somebody
    would or would not add spaces -- or hyphens, for that matter.

    No one would *really* know where they should put hyphens or
    spaces in U+238F OPEN-CIRCUIT-OUTPUT H-TYPE SYMBOL without looking
    it up in the names list. ;-)

    U+269C FLEUR-DE-LIS uses the *English* spelling which in most
    dictionaries shows hyphens, but a French speaker would be more
    likely to use "fleur de lis" without hyphens, since that is
    the French spelling.

    The point of a loose matching rule for character names like
    this is to capture reasonable expectations about what people
    might want to do in contexts like identifier, label, or
    presentation, and still successfully match the
    intended character.

    And the standardization committees (UTC and WG2) are aware of
    the loose matching rule for character names, and check against
    it when creating new character names, so as not to introduce
    character names that would be ambiguous under that loose
    matching rule.

    > > This is a false analogy because Unicode has never said that 'S' is to be
    > > ignored in loose matching. Unicode still says (in TR18) that all
    > > hyphens (except in 3 cases) are to be ignored. If hyphens can be
    > > significant parts of character names, Unicode should never have said
    > > they effectively aren't.
    >
    > UTS 18 is formally a different standard then the Unicode Standard (TUS)
    > (which incorporates UAX#44).
    > In this case, you are correct, UTS#18 is in conflict with UAX#44 and
    > therefore TUS). The three cases may have been the only cases where
    > hyphens resulted in a dinstinct name at the time UTS#18 was drafted, but
    > it's clear that this approach is not robust, as long as UTC can add
    > additional names under the slightly different rules of UAX#44.
    >
    > That should result in a correction/corrigendum for UTS#18.

    Sorry, but I'm not seeing it.

    The conformance requirements for claiming a level of
    conformance in UTS #18, RL 1.5 Simple Loose Matches,
    and RL 2.4 Default Loose Matches, have only to do with
    case-insensitive matching for generic text, and do not
    involve ignoring of whitespace, hyphens or underscores.

    The only mention of loose matching of the type that we
    are talking about is in Section 1.2 Properties, where it is
    referring specifically to *property* names and values. And
    there is it couched as a recommendation -- not a conformance
    requirement:

    "It is strongly recommended that both [long and short] property
    names be recognized, and that loose matching of property names
    be used, whereby the case distinctions, whitespace, hyphens,
    and underbar are ignored."

    And as Asmus pointed out in an earlier note in this thread,
    property names (or more exactly property aliases and
    property value aliases) follow a different pattern than
    character names. They are unambiguously interpretable if
    you ignore all "case distinctions, whitespace, hyphens,
    and underbar", because there are no funky edge cases
    involving medial hyphens for those. In fact there are no
    space characters whatsoever in any of the normative property
    aliases or property value aliases in the Unicode Character
    Database. And if somebody sticks a space (or spaces)
    in a regex expression for something like \p{General Category:Lm}
    instead of using \p{gc:Lm}, well, then the kindly (and
    reasonable) thing for the regex engine to do would be
    to ignore that space, as it is more likely to get the
    expected result than it would by throwing a syntax exception.

    The applicable loose matching rule in this case is not
    the character names loose matching rule (UAX44-LM2), but
    rather the symbolic values loose matching rule (UAX44-LM3).

    Now granted this hasn't been spelled out explicitly in
    the standard all that long -- the elaborations in UAX #44
    are of fairly recent provenance. But this was nevertheless
    the clear intent of the property alias files all along,
    since they first were published as part of the UCD.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Mar 16 2010 - 18:03:21 CST