Re: property, character, and sequence name loose matching

From: karl williamson (public@khwilliamson.com)
Date: Thu Mar 11 2010 - 14:32:40 CST

  • Next message: Andrew West: "Re: Fw: Re: ß vs. ſs"

    Asmus Freytag wrote:
    > On 3/11/2010 11:45 AM, karl williamson wrote:
    >> Mark Davis ☕ wrote:
    >>> I agree that the wording should be clearer. What is meant by
    >>>
    >>> UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
    >>> hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
    >>>
    >>>
    >>> is that when matching two strings, transform each in the following way.
    >>>
    >>> 1. remove all hyphens that are medial (except in U+1180) then
    >>> 2. remove whitespace and underscore, and lowercase.
    >>>
    >>> If after these transforms, the two strings are the same, then they
    >>> match.
    >>>
    >>> This is a logical statement: you can do the transformations in a
    >>> single pass if you are careful, and you also can do the comparison
    >>> while transforming incrementally.
    >>>
    >>> Mark
    >>>
    >>
    >> Ok. Thank you. That's totally clear and implementable. I just want
    >> to be sure that you realize that this means that if the user writes
    >> TIBETAN LETTER-A
    >>
    >> the rules above yield
    >> tibetanlettera
    > Correct, the hyphen, being medial, is removed.
    >>
    >> which maps to
    >> TIBETAN LETTER A
    >>
    >> and not to what they more likely meant
    >> TIBETAN LETTER -A
    > LETTER-A is indeed the same as LETTER A
    >
    > If you want LETTER -A you need to retain the hyphen, and at least one space
    >
    > L E T T E R -A
    >
    > would match
    >>
    >> So therefore in this (and in TIBETAN SUBJOINED LETTER -A) the white
    >> space before the '-' is significant, and that isn't mentioned in the
    >> documents, except tr18.
    > Correct, an eample to that effect in UAX#44 would help clarify the
    > impact of the word "medial" in the rules.
    >
    > A./
    >>

    I think it is actually better to do the following:
    1. Remove all white space
    2. Collapse multiple hyphens in a row into one
    3. Lowercase
    4. If the result is one of the three problematic ones, we are done.
    5. Remove all hyphens

    Then, if the strings are the same after the transforms, they match.



    This archive was generated by hypermail 2.1.5 : Thu Mar 11 2010 - 14:35:52 CST