Re: property, character, and sequence name loose matching

From: Asmus Freytag (
Date: Thu Mar 11 2010 - 12:43:21 CST

  • Next message: karl williamson: "Re: property, character, and sequence name loose matching"

    On 3/11/2010 9:34 AM, karl williamson wrote:
    > Implementers need highly precise wording in a standard. So this
    > sentence in the current UAX44 draft (thanks for the link) is
    > problematic for me:
    > UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
    > hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
    > If whitespace is ignored, then all hyphens are medial, and as tr18
    > points out, there would then be two other confusable cases, involving
    > what you might think of as "initial" hyphens.
    No, you have to construct this differently.

    Initially, you have the formal names that contain words (no spaces
    inside a word) and have a well-defined concept of medial hyphens.

    You can construct other identifiers from that, by the matching rules, as
    long as you can match your new identifier to the existing formal name.

    One requirement appears to be, that you can't add spaces in a way that
    remove the distinction of a medial hypen.

    So, when you have a formal name with an A-B, you can create a name that
    is "AB" but not "A- B" or "A -B". The latter appear to have non-medial
    hyphens, which may not be ignored in matching.

    How you can capture that in a regex, I don't know.

    > So, I'm in a hurry. I don't have time to wait for the next draft of
    > UAX44. Perl 5.12 is in a code freeze. If I misread what you guys
    > intended, it would be good if I knew immediately, so I could go and
    > plead that the revisions I would have to write be allowed in so that
    > the defective version would never get published.
    > My sense, though, is that I didn't misread it, that the statements
    > made in UAX34 and 44 are imprecise, and based on your responses to
    > this email, I will submit an official report through your website.
    The problem is that the rules don't define a regex expression that spans
    a set of strings in a way that is independent of the formal name. They
    only tell you, once you have a formal name, how to match other
    formulations of that name to it.

    But whether or not a hyphen is medial, is defined by the formal name,
    and whether the exception for U+1180 holds requires the context of the
    full name.


    This archive was generated by hypermail 2.1.5 : Thu Mar 11 2010 - 12:46:22 CST