Re: property, character, and sequence name loose matching

From: karl williamson (public@khwilliamson.com)
Date: Mon Mar 15 2010 - 21:15:00 CST

  • Next message: Asmus Freytag: "Re: property, character, and sequence name loose matching"

    There are a couple of things going on here. Keep in mind that my
    perspective is that of someone who is trying to implement what Unicode says.

    First, part of the essence of a medial hyphen is that it not be adjacent
    to white space. Therefore to determine if a hyphen is medial, it is
    required to check for adjacent white space. But in the same sentence
    that Unicode says that hyphens which are medial are to be ignored,
    Unicode says that white space is also to be ignored. It is impossible
    to both ignore and not ignore white space. The number of
    implementations that do what Unicode says here is and will always be zero.

    As an aside, it has been my experience that ignoring all white space
    usually leads to unintended negative consequences. The 1966 ANSI
    Fortran standard suffered from this (I don't know about later versions),
    and it led to problems, with economic consequences. It is a pity that
    this lesson did not get passed on to later generations. I doubt that
    Unicode really wants 'S c r i p t' to mean 'Script', but that's what it
    says. It would have been better in my opinion for it to say that
    multiple white space is equivalent to a single white space.

    But it's probably too late for that, and I haven't thought of all the
    implications either. Perhaps the simplest thing would be to change the
    standard to say that white space not adjacent to hyphens is to be ignored.

    Asmus Freytag wrote:
    > On 3/11/2010 10:12 PM, karl williamson wrote:
    >> Andrew West wrote:
    >>> On 11 March 2010 20:32, karl williamson <public@khwilliamson.com> wrote:
    >>>> I think it is actually better to do the following:
    >>>> 1. Remove all white space
    >>>> 2. Collapse multiple hyphens in a row into one
    >>>> 3. Lowercase
    >>>> 4. If the result is one of the three problematic ones, we are done.
    >>>> 5. Remove all hyphens
    >>>>
    >>>> Then, if the strings are the same after the transforms, they match.
    >>>
    >>> No, then "TIBETAN MARK TSA PHRU" would match "TIBETAN MARK TSA -PHRU",
    >>> which may be what the user intended, but it is not what they asked
    >>> for, and would be as bad as matching e.g. "PERCENT IGN" and "PERCENT
    >>> SIGN".

    This is a false analogy because Unicode has never said that 'S' is to be
    ignored in loose matching. Unicode still says (in TR18) that all
    hyphens (except in 3 cases) are to be ignored. If hyphens can be
    significant parts of character names, Unicode should never have said
    they effectively aren't.

    >>>
    >>> Andrew
    >>>
    >>
    >> OK, but that is a change from what TR18 says: "names should use a
    >> loose match, disregarding case, spaces and hyphen" except for the
    >> three problematic situations it mentions. There is no character
    >> TIBETAN MARK TSA PHRU,
    > But it's a name that could be added to the standard at any moment,
    > because it would be formally distinct from any existing
    >
    > TIBETAN MARK TSA -PHRU

    I find this statement very disconcerting, because it means that I cannot
    trust what Unicode says. TR18 for the last almost 7 years and 4 or so
    versions has said that all hyphens (except for the 3 cases) can be
    ignored. Now you're saying that Unicode feels free to add more such
    cases, thus causing implementations that relied on Unicode's word to
    fail. The failure will probably be subtle, so it won't be immediately
    apparent.

    Yes it's true that backward compatibility cannot always be guaranteed;
    but it should always be a goal, and the reasons for breaking it should
    be compelling.

    Unicode could choose names that don't violate TR18. Choosing ones that
    do shows disrespect to your customers, in my opinion.

    That said, I can also say that Perl 5 has not implemented loose matching
    for character names, so will not be affected by any immediate changes to
    it. I also know that no one has strictly implemented Unicode's
    definition of loose matching because it is impossible to do so. But I
    don't know what any implementations actually have done.
    >
    > so you can't simply match according to what might be intended, because
    > then, if such a character is later added, everything fails.
    >> and I thought the whole point of loose matching is to follow the
    >> intent of the user even in the face of certain missing or extraneous
    >> punctuation and spacing characters, so even though it is not exactly
    >> what they asked for, it is close enough by the traditional definition.
    >>
    >> I realize that TR18 is not an official part of the standard, and that
    >> TR44 is now UAX44, so is. Therefore, this is a change in the
    >> standard that I don't believe was listed as a delta.
    >>
    >>
    >



    This archive was generated by hypermail 2.1.5 : Mon Mar 15 2010 - 21:22:32 CST