Re: property, character, and sequence name loose matching

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Mar 15 2010 - 21:49:31 CST

  • Next message: jandersen@talentex.co.uk: "Re: Århus mayor prefers Aarhus - "believing the ‘Å’ is a hindrance in international communication""

    On 3/15/2010 8:15 PM, karl williamson wrote:
    > There are a couple of things going on here. Keep in mind that my
    > perspective is that of someone who is trying to implement what Unicode
    > says.
    >
    > First, part of the essence of a medial hyphen is that it not be adjacent
    > to white space. Therefore to determine if a hyphen is medial, it is
    > required to check for adjacent white space. But in the same sentence
    > that Unicode says that hyphens which are medial are to be ignored,
    > Unicode says that white space is also to be ignored. It is impossible
    > to both ignore and not ignore white space. The number of
    > implementations that do what Unicode says here is and will always be
    > zero.

    Fine, you've made your point that

        /*UAX44-LM2.*/ Ignore case, whitespace, underscore ('_'), and all
        medial hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.

            * "zero-width space" is equivalent to "ZERO WIDTH SPACE" or
              "zerowidthspace"
            * "character -a" is /not/ equivalent to "character a"

    could be improved to note the interaction between the presence/absence
    of spaces and "medial". (I believe that's actually in the works).

    >
    > As an aside, it has been my experience that ignoring all white space
    > usually leads to unintended negative consequences. The 1966 ANSI
    > Fortran standard suffered from this (I don't know about later
    > versions), and it led to problems, with economic consequences. It is
    > a pity that this lesson did not get passed on to later generations. I
    > doubt that Unicode really wants 'S c r i p t' to mean 'Script', but
    > that's what it says. It would have been better in my opinion for it
    > to say that multiple white space is equivalent to a single white space.

    That's a good point, even though you misrepresent the intention of
    Unicode. Of interest here is not the folding of multiple spaces into one
    as much as allowing CamelCase version of names (instead of UPPER CASE or
    lower case with spaces).

    At the same time, it there are some names, esp. charater names, where
    users might disagree about where to add spaces. It was felt useful to
    allow the use not only of fewer spaces, but also of more spaces than the
    formal name.

    You've made your point that "s c r i p t" should be discouraged somehow,
    but it's difficult to see how that can be done effectively.

    >
    > But it's probably too late for that, and I haven't thought of all the
    > implications either. Perhaps the simplest thing would be to change
    > the standard to say that white space not adjacent to hyphens is to be
    > ignored.
    See above.
    >
    > Asmus Freytag wrote:
    >> On 3/11/2010 10:12 PM, karl williamson wrote:
    >>> Andrew West wrote:
    >>>> On 11 March 2010 20:32, karl williamson <public@khwilliamson.com>
    >>>> wrote:
    >>>>> I think it is actually better to do the following:
    >>>>> 1. Remove all white space
    >>>>> 2. Collapse multiple hyphens in a row into one
    >>>>> 3. Lowercase
    >>>>> 4. If the result is one of the three problematic ones, we are done.
    >>>>> 5. Remove all hyphens
    >>>>>
    >>>>> Then, if the strings are the same after the transforms, they match.
    >>>>
    >>>> No, then "TIBETAN MARK TSA PHRU" would match "TIBETAN MARK TSA -PHRU",
    >>>> which may be what the user intended, but it is not what they asked
    >>>> for, and would be as bad as matching e.g. "PERCENT IGN" and "PERCENT
    >>>> SIGN".
    >
    > This is a false analogy because Unicode has never said that 'S' is to be
    > ignored in loose matching. Unicode still says (in TR18) that all
    > hyphens (except in 3 cases) are to be ignored. If hyphens can be
    > significant parts of character names, Unicode should never have said
    > they effectively aren't.

    UTS 18 is formally a different standard then the Unicode Standard (TUS)
    (which incorporates UAX#44).
    In this case, you are correct, UTS#18 is in conflict with UAX#44 and
    therefore TUS). The three cases may have been the only cases where
    hyphens resulted in a dinstinct name at the time UTS#18 was drafted, but
    it's clear that this approach is not robust, as long as UTC can add
    additional names under the slightly different rules of UAX#44.

    That should result in a correction/corrigendum for UTS#18.
    >
    >>>>
    >>>> Andrew
    >>>>
    >>>
    >>> OK, but that is a change from what TR18 says: "names should use a
    >>> loose match, disregarding case, spaces and hyphen" except for the
    >>> three problematic situations it mentions. There is no character
    >>> TIBETAN MARK TSA PHRU,
    >> But it's a name that could be added to the standard at any moment,
    >> because it would be formally distinct from any existing
    >>
    >> TIBETAN MARK TSA -PHRU
    >
    > I find this statement very disconcerting, because it means that I
    > cannot trust what Unicode says. TR18 for the last almost 7 years and
    > 4 or so versions has said that all hyphens (except for the 3 cases)
    > can be ignored. Now you're saying that Unicode feels free to add more
    > such cases, thus causing implementations that relied on Unicode's word
    > to fail. The failure will probably be subtle, so it won't be
    > immediately apparent.
    TUS and UTS#18 are each standalone specifications. They are created by
    the same organization, but their content agrees only to the extent that
    the editors ensure that the statements in one are compatible with the
    other. You've found a case where a bug in the specification in UTS#18
    has been overlooked and you've made your point.

    In this case, the bug rests with UTS#18, and not the other way around.
    >
    > Yes it's true that backward compatibility cannot always be guaranteed;
    > but it should always be a goal, and the reasons for breaking it should
    > be compelling.
    >
    > Unicode could choose names that don't violate TR18. Choosing ones
    > that do shows disrespect to your customers, in my opinion.
    Character names are chosen by UTC and WG2. The latter is not involved in
    creating or maintaining UTS#18.
    There is no issue for property names, because there is no exception for
    medial hyphens for those properties that are not character names.
    >
    > That said, I can also say that Perl 5 has not implemented loose
    > matching for character names, so will not be affected by any immediate
    > changes to it. I also know that no one has strictly implemented
    > Unicode's definition of loose matching because it is impossible to do
    > so. But I don't know what any implementations actually have done.
    >>
    >> so you can't simply match according to what might be intended,
    >> because then, if such a character is later added, everything fails.
    >>> and I thought the whole point of loose matching is to follow the
    >>> intent of the user even in the face of certain missing or extraneous
    >>> punctuation and spacing characters, so even though it is not exactly
    >>> what they asked for, it is close enough by the traditional definition.
    >>>
    >>> I realize that TR18 is not an official part of the standard, and
    >>> that TR44 is now UAX44, so is. Therefore, this is a change in the
    >>> standard that I don't believe was listed as a delta.
    >>>
    >>>
    >>
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Mar 15 2010 - 21:52:56 CST