Re: property, character, and sequence name loose matching

From: karl williamson (public@khwilliamson.com)
Date: Sat Mar 20 2010 - 22:40:09 CST

  • Next message: Christopher Fynn: "Re: Support in Silverlight 4.0 for all major Indic Languages including Tamil"

    karl williamson wrote:
    > Kenneth Whistler wrote:
    > > Asmus (responding to Karl Williamson) noted:
    > >
    > >> Fine, you've made your point that
    >
    > I'm sorry if I belabored my points; I didn't realize that they had sunk
    > in, so was restating in simpler terms.
    >
    > >>
    > >> /*UAX44-LM2.*/ Ignore case, whitespace, underscore ('_'), and all
    > >> medial hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
    > >>
    > >> * "zero-width space" is equivalent to "ZERO WIDTH SPACE" or
    > >> "zerowidthspace"
    > >> * "character -a" is /not/ equivalent to "character a"
    > >>
    > >> could be improved to note the interaction between the presence/absence
    > >> of spaces and "medial". (I believe that's actually in the works).
    > >
    > > Indeed:
    > >
    > > http://www.unicode.org/reports/tr44/tr44-5.html#Matching_Rules
    > >
    > > for the Unicode 6.0 proposed update draft. Now is everybody's
    > > chance to comment if anything about that clarification is still
    > > problematical.
    >
    > It still seems wrong to me.
    >
    > First, a question. What does "whitespace" mean? Is it more than the
    > SPACE character?
    >
    > I see two possible directions to go in implementing this to best get the
    > user's intent.
    >
    > One is to say that input white space is significant adjacent to hyphens,
    > and to ignore input medial hyphens. But, this doesn't get the best
    > results, For example, if the user inputs \N{TIBETAN LETTER-A} (probably
    > meaning U+0F60, \N{TIBETAN LETTER -A}), it would get parsed instead as
    > \N{TIBETAN LETTER A} (= U+0F68).
    >
    > The other way is to keep a list of the problematic code points to handle
    > specially, of which the above is one (the implementation could either
    > consider this an error as being ambiguous, or interpret it as U+0F60).
    > This is the approach advocated in TR18, and as far as I know, its
    > current list of three is correct for TUS 5.2 and earlier. The problem
    > with this approach, as Asmus has pointed out, is that the list can
    > change in future Unicode versions. As an implementor, if I know that
    > the list can change, I can write code that recalculates the list for
    > each new Unicode release. I have no problem with that. Or if Unicode
    > were good about listing all the gotcha changes for a new release, I
    > could keep the list by hand, and add to it when necessary.
    >
    > It seems to me that the new proposed wording of UAX44 could be saying
    > kind of the reverse of this second approach, as it uses not the
    > problematic cases, but the normal. That is, it defines a medial hyphen
    > as one occurring in the official name, thus the implementation is
    > supposed to know which are those (a much longer list than the
    > problematic ones). It specifically says that hyphens that become medial
    > as a result of removing white space in transforming the input are not to
    > be considered medial. But it doesn't mention hyphens that were input as
    > medial but aren't in the official name. If I take the text literally,
    > as one is supposed to in a standard, then these are to be considered
    > significant, as they don't fit its definition of medial, and hence will
    > cause the input name to not match anything. Thus U+112C HANGUL CHOSEONG
    > KAP-YEOUN-SSANG-PIEUP (I'm guessing at the syllable boundaries) would
    > not match any official name.
    >
    > I think the best solution is to go back to the TR18 approach for the
    > whole standard, UAX44 and Chapter 4, but explicitly say that the list is
    > subject to addition.

    I have thought about this some more, and realized it doesn't work,
    because working code could suddenly stop working when a new character
    gets added to a later version of Unicode. For example, if \N{TIBETAN
    LETTER A} had been defined in an earlier release than
    \N{TIBETAN LETTER -A}, with the method I proposed, one could have said
    \N{TIBETAN LETTER-A} in that first release unambiguously, but when the
    new release came out, that same code would match the other character.

    The bottom line is that for a code point that has a normative name with
    a non-medial hyphen, the hyphen must be input as non-medial for the name
    to be properly parsed to the correct code point. In other words, there
    must be a space adjacent to such a hyphen (or be at the beginning or end
    of the string if those becomes legal). Perhaps that was what was
    intended in the proposed UAX44 language, but it doesn't convey that to me.
    > >
    > >>> As an aside, it has been my experience that ignoring all white space
    > >>> usually leads to unintended negative consequences. The 1966 ANSI
    > >>> Fortran standard suffered from this (I don't know about later
    > >>> versions), and it led to problems, with economic consequences. It is
    > >>> a pity that this lesson did not get passed on to later generations. I
    > >>> doubt that Unicode really wants 'S c r i p t' to mean 'Script', but
    > >>> that's what it says. It would have been better in my opinion for it
    > >>> to say that multiple white space is equivalent to a single white
    > space.
    > >> That's a good point, even though you misrepresent the intention of
    > >> Unicode. Of interest here is not the folding of multiple spaces into
    > one
    > >> as much as allowing CamelCase version of names (instead of UPPER
    > CASE or
    > >> lower case with spaces).
    > >>
    > >> At the same time, it there are some names, esp. charater names, where
    > >> users might disagree about where to add spaces. It was felt useful to
    > >> allow the use not only of fewer spaces, but also of more spaces than
    > the
    > >> formal name.
    > >
    > > There are examples such as U+003C LESS-THAN SIGN, where one wouldn't
    > > want what might be a fairly common spelling for a match,
    > > "less than sign" not to match the formal name "LESS-THAN SIGN".
    > >
    > > In Hangul jamo letter names like U+112C HANGUL CHOSEONG
    > > KAPYEOUNSSANGPIEUP, the last part is actually four syllables,
    > > KAP YEOUN SSANG PIEUP, and you might not know where somebody
    > > would or would not add spaces -- or hyphens, for that matter.
    > >
    > > No one would *really* know where they should put hyphens or
    > > spaces in U+238F OPEN-CIRCUIT-OUTPUT H-TYPE SYMBOL without looking
    > > it up in the names list. ;-)
    > >
    > > U+269C FLEUR-DE-LIS uses the *English* spelling which in most
    > > dictionaries shows hyphens, but a French speaker would be more
    > > likely to use "fleur de lis" without hyphens, since that is
    > > the French spelling.
    > >
    > > The point of a loose matching rule for character names like
    > > this is to capture reasonable expectations about what people
    > > might want to do in contexts like identifier, label, or
    > > presentation, and still successfully match the
    > > intended character.
    > >
    > > And the standardization committees (UTC and WG2) are aware of
    > > the loose matching rule for character names, and check against
    > > it when creating new character names, so as not to introduce
    > > character names that would be ambiguous under that loose
    > > matching rule.
    > >
    > >>> This is a false analogy because Unicode has never said that 'S' is
    > to be
    > >>> ignored in loose matching. Unicode still says (in TR18) that all
    > >>> hyphens (except in 3 cases) are to be ignored. If hyphens can be
    > >>> significant parts of character names, Unicode should never have said
    > >>> they effectively aren't.
    > >> UTS 18 is formally a different standard then the Unicode Standard (TUS)
    > >> (which incorporates UAX#44).
    > >> In this case, you are correct, UTS#18 is in conflict with UAX#44 and
    > >> therefore TUS). The three cases may have been the only cases where
    > >> hyphens resulted in a dinstinct name at the time UTS#18 was drafted,
    > but
    > >> it's clear that this approach is not robust, as long as UTC can add
    > >> additional names under the slightly different rules of UAX#44.
    > >>
    > >> That should result in a correction/corrigendum for UTS#18.
    > >
    > > Sorry, but I'm not seeing it.
    > >
    > > The conformance requirements for claiming a level of
    > > conformance in UTS #18, RL 1.5 Simple Loose Matches,
    > > and RL 2.4 Default Loose Matches, have only to do with
    > > case-insensitive matching for generic text, and do not
    > > involve ignoring of whitespace, hyphens or underscores.
    > >
    > > The only mention of loose matching of the type that we
    > > are talking about is in Section 1.2 Properties, where it is
    > > referring specifically to *property* names and values. And
    > > there is it couched as a recommendation -- not a conformance
    > > requirement:
    > >
    > > "It is strongly recommended that both [long and short] property
    > > names be recognized, and that loose matching of property names
    > > be used, whereby the case distinctions, whitespace, hyphens,
    > > and underbar are ignored."
    > >
    > > And as Asmus pointed out in an earlier note in this thread,
    > > property names (or more exactly property aliases and
    > > property value aliases) follow a different pattern than
    > > character names. They are unambiguously interpretable if
    > > you ignore all "case distinctions, whitespace, hyphens,
    > > and underbar", because there are no funky edge cases
    > > involving medial hyphens for those. In fact there are no
    > > space characters whatsoever in any of the normative property
    > > aliases or property value aliases in the Unicode Character
    > > Database. And if somebody sticks a space (or spaces)
    > > in a regex expression for something like \p{General Category:Lm}
    > > instead of using \p{gc:Lm}, well, then the kindly (and
    > > reasonable) thing for the regex engine to do would be
    > > to ignore that space, as it is more likely to get the
    > > expected result than it would by throwing a syntax exception.
    > >
    > > The applicable loose matching rule in this case is not
    > > the character names loose matching rule (UAX44-LM2), but
    > > rather the symbolic values loose matching rule (UAX44-LM3).
    > >
    > > Now granted this hasn't been spelled out explicitly in
    > > the standard all that long -- the elaborations in UAX #44
    > > are of fairly recent provenance. But this was nevertheless
    > > the clear intent of the property alias files all along,
    > > since they first were published as part of the UCD.
    > >
    > > --Ken
    > >
    > >
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Sat Mar 20 2010 - 22:47:56 CST