Re: property, character, and sequence name loose matching

From: Mark Davis ☕ (mark@macchiato.com)
Date: Thu Mar 11 2010 - 12:34:16 CST

  • Next message: Asmus Freytag: "Re: property, character, and sequence name loose matching"

    I agree that the wording should be clearer. What is meant by

    UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial hyphens
    except the hyphen in U+1180 HANGUL JUNGSEONG O-E.

    is that when matching two strings, transform each in the following way.

       1. remove all hyphens that are medial (except in U+1180) then
       2. remove whitespace and underscore, and lowercase.

    If after these transforms, the two strings are the same, then they match.

    This is a logical statement: you can do the transformations in a single pass
    if you are careful, and you also can do the comparison while transforming
    incrementally.

    Mark

    On Thu, Mar 11, 2010 at 09:34, karl williamson <public@khwilliamson.com>wrote:

    > Kenneth Whistler wrote:
    >
    >> The loose matching rules in TR18 say to ignore white space, underscores,
    >>>>> and hyphens. That means that someone could insert white space into the
    >>>>> middle of what is supposed to be a single word, like
    >>>>> \p{s c r i p t: greek}. Same for character names.
    >>>>>
    >>>> Actually, it doesn't mean that you can arbitrarily ignore
    >>>> the identifier syntax of particular formalizations.
    >>>>
    >>> I don't understand your sentence. I'm guessing you mean that
    >>> 's c r i p t' is not the same as 'script', even though tr18 says "case
    >>> distinctions, whitespace, hyphens, and underbar are ignored." If so,
    >>> shouldn't tr18 be clarified?
    >>>
    >>
    >> I should have said "pattern syntax" rather than "identifier syntax"
    >> in this case, but the point is that while UTS #18 makes
    >> a general statement about how pattern matching for property
    >> names and values should be done, you still have to pay attention
    >> to the details of the actual implementations.
    >>
    >> Without checking an actual implementation of java.util.regex Class
    >> Pattern, I don't know whether:
    >>
    >> \p{_________ -------s c r i p________--_- t ___: greek}
    >>
    >> would actually match the Unicode Script property or would
    >> throw a PatternSyntaxException.
    >>
    >> You can try it and find out, I suppose. But that isn't
    >> really so much an issue for UTS #18 but rather something to take
    >> up with the implementers of Java, Perl, and other regex
    >> engines.
    >>
    >>
    > The reason I'm asking this is that I am an implementer of Perl's regex
    > engine. I didn't realize that that fact would be germane to my question, so
    > I didn't mention it. Sorry. I'm not interested in what's advisable or not
    > to use; I'm interested in what the engine should accept versus throw an
    > exception on, and hence how I need to write the engine. So I am seeking
    > clarification of what TUS would like from an implementation.
    >
    > In the past Perl has not accepted the full loose matching rules, but now I
    > have implemented what I thought were them for the soon-to-be-released Perl
    > 5.12. Perl 5 is an open-source project; I am a volunteer with some
    > background and interest in the topic, but not an expert. I am, however, an
    > expert software developer, retired now, so I have some time to devote to
    > this.
    >
    > Based on my reading of TR18 and UAX44, I changed the Perl regex engine so
    > it would parse things like what Ken mentioned above:
    >
    > \p{_________ -------s c r i p________--_- t ___: greek}
    > as meaning \p{script:greek}, without throwing an exception. Again, it's
    > not advisable for someone to write something like that, but it appears to me
    > to be permissible, and so I wrote the regex engine to handle it.
    >
    > I am starting out to add loose matching to the regex engine for character
    > names for the next release of Perl 5 (and I anticipate adding support for
    > named sequences in Perl by then, so for them as well).
    >
    > Effectively, it was pointed out that my reading of what I thought was the
    > plain wording of the standard might be wrong, since, if there can be a space
    > between any two characters, the concept of word is meaningless, and
    > therefore the concept of a medial hyphen is as well. Conversely, if words
    > can be run-on together, all hyphens (except at the very beginning and end of
    > the string) become medial, and so the distinction is also meaningless.
    >
    >
    > What it means is that such names as:
    >>>>
    >>>> CHARACTER BZZT
    >>>> CHARACTER B-ZZ-T
    >>>> CHARACTER BZ-ZT
    >>>>
    >>> What about
    >>> CHARACER BZ--ZT
    >>> ?
    >>>
    >>
    >> What about it?
    >>
    >> "CHARACER BZ--ZT" won't loose match "CHARACTER BZZT", because
    >> the first one is missing the "T" in "CHARACTER". But then,
    >> I don't suppose that was your question.
    >>
    >
    > Sorry for the typo, and thanks for figuring out what I really meant.
    >
    >
    >> The loose matching rules would not distinguish:
    >>
    >> CHARACTER BZZT
    >>
    >> from
    >>
    >> CHARACTER BZ--ZT
    >>
    >> or for that matter, from
    >>
    >> CHARACTER BZ---------------------------------------------------ZT
    >>
    >> But if your question is, rather, would "CHARACTER BZ--ZT" be
    >> allowed as a Unicode character name, the answer is no.
    >> But the reason for that cannot be found in UTS #18. The reason
    >> is because it would be stupid and pointless to name a character that way,
    >> and the folks in the relevant maintenance committees are not
    >> stupid.
    >>
    >
    > Of course
    >
    >
    >> In general, if there is something unclear about matching rules
    >> in the Unicode Standard, a more fruitful direction would be to
    >> examine the relevant text in the proposed update for UAX #44
    >> and suggest any required clarifications to the UTC, if there
    >> really is an issue of ambiguity in that text. See:
    >>
    >> http://www.unicode.org/reports/tr44/tr44-5.html#Matching_Rules
    >>
    >> --Ken
    >>
    >>
    > Implementers need highly precise wording in a standard. So this sentence
    > in the current UAX44 draft (thanks for the link) is problematic for me:
    >
    > UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
    > hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
    >
    > If whitespace is ignored, then all hyphens are medial, and as tr18 points
    > out, there would then be two other confusable cases, involving what you
    > might think of as "initial" hyphens.
    >
    > So, I'm in a hurry. I don't have time to wait for the next draft of UAX44.
    > Perl 5.12 is in a code freeze. If I misread what you guys intended, it
    > would be good if I knew immediately, so I could go and plead that the
    > revisions I would have to write be allowed in so that the defective version
    > would never get published.
    >
    > My sense, though, is that I didn't misread it, that the statements made in
    > UAX34 and 44 are imprecise, and based on your responses to this email, I
    > will submit an official report through your website.
    >
    >>
    >>
    >>
    >



    This archive was generated by hypermail 2.1.5 : Thu Mar 11 2010 - 12:37:37 CST