Re: property, character, and sequence name loose matching

From: karl williamson (public@khwilliamson.com)
Date: Sat Mar 20 2010 - 12:05:12 CST

  • Next message: karl williamson: "Re: property, character, and sequence name loose matching"

    Kenneth Whistler wrote:
    > Asmus (responding to Karl Williamson) noted:
    >
    >> Fine, you've made your point that

    I'm sorry if I belabored my points; I didn't realize that they had sunk
    in, so was restating in simpler terms.

    >>
    >> /*UAX44-LM2.*/ Ignore case, whitespace, underscore ('_'), and all
    >> medial hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
    >>
    >> * "zero-width space" is equivalent to "ZERO WIDTH SPACE" or
    >> "zerowidthspace"
    >> * "character -a" is /not/ equivalent to "character a"
    >>
    >> could be improved to note the interaction between the presence/absence
    >> of spaces and "medial". (I believe that's actually in the works).
    >
    > Indeed:
    >
    > http://www.unicode.org/reports/tr44/tr44-5.html#Matching_Rules
    >
    > for the Unicode 6.0 proposed update draft. Now is everybody's
    > chance to comment if anything about that clarification is still
    > problematical.

    It still seems wrong to me.

    First, a question. What does "whitespace" mean? Is it more than the
    SPACE character?

    I see two possible directions to go in implementing this to best get the
    user's intent.

    One is to say that input white space is significant adjacent to hyphens,
    and to ignore input medial hyphens. But, this doesn't get the best
    results, For example, if the user inputs \N{TIBETAN LETTER-A} (probably
    meaning U+0F60, \N{TIBETAN LETTER -A}), it would get parsed instead as
    \N{TIBETAN LETTER A} (= U+0F68).

    The other way is to keep a list of the problematic code points to handle
    specially, of which the above is one (the implementation could either
    consider this an error as being ambiguous, or interpret it as U+0F60).
    This is the approach advocated in TR18, and as far as I know, its
    current list of three is correct for TUS 5.2 and earlier. The problem
    with this approach, as Asmus has pointed out, is that the list can
    change in future Unicode versions. As an implementor, if I know that
    the list can change, I can write code that recalculates the list for
    each new Unicode release. I have no problem with that. Or if Unicode
    were good about listing all the gotcha changes for a new release, I
    could keep the list by hand, and add to it when necessary.

    It seems to me that the new proposed wording of UAX44 could be saying
    kind of the reverse of this second approach, as it uses not the
    problematic cases, but the normal. That is, it defines a medial hyphen
    as one occurring in the official name, thus the implementation is
    supposed to know which are those (a much longer list than the
    problematic ones). It specifically says that hyphens that become medial
    as a result of removing white space in transforming the input are not to
    be considered medial. But it doesn't mention hyphens that were input as
    medial but aren't in the official name. If I take the text literally,
    as one is supposed to in a standard, then these are to be considered
    significant, as they don't fit its definition of medial, and hence will
    cause the input name to not match anything. Thus U+112C HANGUL CHOSEONG
    KAP-YEOUN-SSANG-PIEUP (I'm guessing at the syllable boundaries) would
    not match any official name.

    I think the best solution is to go back to the TR18 approach for the
    whole standard, UAX44 and Chapter 4, but explicitly say that the list is
    subject to addition.
    >
    >>> As an aside, it has been my experience that ignoring all white space
    >>> usually leads to unintended negative consequences. The 1966 ANSI
    >>> Fortran standard suffered from this (I don't know about later
    >>> versions), and it led to problems, with economic consequences. It is
    >>> a pity that this lesson did not get passed on to later generations. I
    >>> doubt that Unicode really wants 'S c r i p t' to mean 'Script', but
    >>> that's what it says. It would have been better in my opinion for it
    >>> to say that multiple white space is equivalent to a single white space.
    >> That's a good point, even though you misrepresent the intention of
    >> Unicode. Of interest here is not the folding of multiple spaces into
    one
    >> as much as allowing CamelCase version of names (instead of UPPER
    CASE or
    >> lower case with spaces).
    >>
    >> At the same time, it there are some names, esp. charater names, where
    >> users might disagree about where to add spaces. It was felt useful to
    >> allow the use not only of fewer spaces, but also of more spaces than
    the
    >> formal name.
    >
    > There are examples such as U+003C LESS-THAN SIGN, where one wouldn't
    > want what might be a fairly common spelling for a match,
    > "less than sign" not to match the formal name "LESS-THAN SIGN".
    >
    > In Hangul jamo letter names like U+112C HANGUL CHOSEONG
    > KAPYEOUNSSANGPIEUP, the last part is actually four syllables,
    > KAP YEOUN SSANG PIEUP, and you might not know where somebody
    > would or would not add spaces -- or hyphens, for that matter.
    >
    > No one would *really* know where they should put hyphens or
    > spaces in U+238F OPEN-CIRCUIT-OUTPUT H-TYPE SYMBOL without looking
    > it up in the names list. ;-)
    >
    > U+269C FLEUR-DE-LIS uses the *English* spelling which in most
    > dictionaries shows hyphens, but a French speaker would be more
    > likely to use "fleur de lis" without hyphens, since that is
    > the French spelling.
    >
    > The point of a loose matching rule for character names like
    > this is to capture reasonable expectations about what people
    > might want to do in contexts like identifier, label, or
    > presentation, and still successfully match the
    > intended character.
    >
    > And the standardization committees (UTC and WG2) are aware of
    > the loose matching rule for character names, and check against
    > it when creating new character names, so as not to introduce
    > character names that would be ambiguous under that loose
    > matching rule.
    >
    >>> This is a false analogy because Unicode has never said that 'S' is
    to be
    >>> ignored in loose matching. Unicode still says (in TR18) that all
    >>> hyphens (except in 3 cases) are to be ignored. If hyphens can be
    >>> significant parts of character names, Unicode should never have said
    >>> they effectively aren't.
    >> UTS 18 is formally a different standard then the Unicode Standard (TUS)
    >> (which incorporates UAX#44).
    >> In this case, you are correct, UTS#18 is in conflict with UAX#44 and
    >> therefore TUS). The three cases may have been the only cases where
    >> hyphens resulted in a dinstinct name at the time UTS#18 was drafted,
    but
    >> it's clear that this approach is not robust, as long as UTC can add
    >> additional names under the slightly different rules of UAX#44.
    >>
    >> That should result in a correction/corrigendum for UTS#18.
    >
    > Sorry, but I'm not seeing it.
    >
    > The conformance requirements for claiming a level of
    > conformance in UTS #18, RL 1.5 Simple Loose Matches,
    > and RL 2.4 Default Loose Matches, have only to do with
    > case-insensitive matching for generic text, and do not
    > involve ignoring of whitespace, hyphens or underscores.
    >
    > The only mention of loose matching of the type that we
    > are talking about is in Section 1.2 Properties, where it is
    > referring specifically to *property* names and values. And
    > there is it couched as a recommendation -- not a conformance
    > requirement:
    >
    > "It is strongly recommended that both [long and short] property
    > names be recognized, and that loose matching of property names
    > be used, whereby the case distinctions, whitespace, hyphens,
    > and underbar are ignored."
    >
    > And as Asmus pointed out in an earlier note in this thread,
    > property names (or more exactly property aliases and
    > property value aliases) follow a different pattern than
    > character names. They are unambiguously interpretable if
    > you ignore all "case distinctions, whitespace, hyphens,
    > and underbar", because there are no funky edge cases
    > involving medial hyphens for those. In fact there are no
    > space characters whatsoever in any of the normative property
    > aliases or property value aliases in the Unicode Character
    > Database. And if somebody sticks a space (or spaces)
    > in a regex expression for something like \p{General Category:Lm}
    > instead of using \p{gc:Lm}, well, then the kindly (and
    > reasonable) thing for the regex engine to do would be
    > to ignore that space, as it is more likely to get the
    > expected result than it would by throwing a syntax exception.
    >
    > The applicable loose matching rule in this case is not
    > the character names loose matching rule (UAX44-LM2), but
    > rather the symbolic values loose matching rule (UAX44-LM3).
    >
    > Now granted this hasn't been spelled out explicitly in
    > the standard all that long -- the elaborations in UAX #44
    > are of fairly recent provenance. But this was nevertheless
    > the clear intent of the property alias files all along,
    > since they first were published as part of the UCD.
    >
    > --Ken
    >
    >



    This archive was generated by hypermail 2.1.5 : Sat Mar 20 2010 - 12:14:15 CST