Re: property, character, and sequence name loose matching

From: karl williamson (
Date: Thu Mar 11 2010 - 11:34:01 CST

  • Next message: philip chastney: "Fw: Re: ß vs. ſs"

    Kenneth Whistler wrote:
    >>>> The loose matching rules in TR18 say to ignore white space, underscores,
    >>>> and hyphens. That means that someone could insert white space into the
    >>>> middle of what is supposed to be a single word, like
    >>>> \p{s c r i p t: greek}. Same for character names.
    >>> Actually, it doesn't mean that you can arbitrarily ignore
    >>> the identifier syntax of particular formalizations.
    >> I don't understand your sentence. I'm guessing you mean that
    >> 's c r i p t' is not the same as 'script', even though tr18 says "case
    >> distinctions, whitespace, hyphens, and underbar are ignored." If so,
    >> shouldn't tr18 be clarified?
    > I should have said "pattern syntax" rather than "identifier syntax"
    > in this case, but the point is that while UTS #18 makes
    > a general statement about how pattern matching for property
    > names and values should be done, you still have to pay attention
    > to the details of the actual implementations.
    > Without checking an actual implementation of java.util.regex Class
    > Pattern, I don't know whether:
    > \p{_________ -------s c r i p________--_- t ___: greek}
    > would actually match the Unicode Script property or would
    > throw a PatternSyntaxException.
    > You can try it and find out, I suppose. But that isn't
    > really so much an issue for UTS #18 but rather something to take
    > up with the implementers of Java, Perl, and other regex
    > engines.

    The reason I'm asking this is that I am an implementer of Perl's regex
    engine. I didn't realize that that fact would be germane to my
    question, so I didn't mention it. Sorry. I'm not interested in what's
    advisable or not to use; I'm interested in what the engine should accept
    versus throw an exception on, and hence how I need to write the engine.
      So I am seeking clarification of what TUS would like from an

    In the past Perl has not accepted the full loose matching rules, but now
    I have implemented what I thought were them for the soon-to-be-released
    Perl 5.12. Perl 5 is an open-source project; I am a volunteer with some
    background and interest in the topic, but not an expert. I am, however,
    an expert software developer, retired now, so I have some time to devote
    to this.

    Based on my reading of TR18 and UAX44, I changed the Perl regex engine
    so it would parse things like what Ken mentioned above:
    \p{_________ -------s c r i p________--_- t ___: greek}
    as meaning \p{script:greek}, without throwing an exception. Again, it's
    not advisable for someone to write something like that, but it appears
    to me to be permissible, and so I wrote the regex engine to handle it.

    I am starting out to add loose matching to the regex engine for
    character names for the next release of Perl 5 (and I anticipate adding
    support for named sequences in Perl by then, so for them as well).

    Effectively, it was pointed out that my reading of what I thought was
    the plain wording of the standard might be wrong, since, if there can be
    a space between any two characters, the concept of word is meaningless,
    and therefore the concept of a medial hyphen is as well. Conversely, if
    words can be run-on together, all hyphens (except at the very beginning
    and end of the string) become medial, and so the distinction is also

    >>> What it means is that such names as:
    >> What about
    >> ?
    > What about it?
    > "CHARACER BZ--ZT" won't loose match "CHARACTER BZZT", because
    > the first one is missing the "T" in "CHARACTER". But then,
    > I don't suppose that was your question.

    Sorry for the typo, and thanks for figuring out what I really meant.
    > The loose matching rules would not distinguish:
    > from
    > or for that matter, from
    > CHARACTER BZ---------------------------------------------------ZT
    > But if your question is, rather, would "CHARACTER BZ--ZT" be
    > allowed as a Unicode character name, the answer is no.
    > But the reason for that cannot be found in UTS #18. The reason
    > is because it would be stupid and pointless to name a character that way,
    > and the folks in the relevant maintenance committees are not
    > stupid.

    Of course
    > In general, if there is something unclear about matching rules
    > in the Unicode Standard, a more fruitful direction would be to
    > examine the relevant text in the proposed update for UAX #44
    > and suggest any required clarifications to the UTC, if there
    > really is an issue of ambiguity in that text. See:
    > --Ken

    Implementers need highly precise wording in a standard. So this
    sentence in the current UAX44 draft (thanks for the link) is problematic
    for me:

    UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
    hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.

    If whitespace is ignored, then all hyphens are medial, and as tr18
    points out, there would then be two other confusable cases, involving
    what you might think of as "initial" hyphens.

    So, I'm in a hurry. I don't have time to wait for the next draft of
    UAX44. Perl 5.12 is in a code freeze. If I misread what you guys
    intended, it would be good if I knew immediately, so I could go and
    plead that the revisions I would have to write be allowed in so that the
    defective version would never get published.

    My sense, though, is that I didn't misread it, that the statements made
    in UAX34 and 44 are imprecise, and based on your responses to this
    email, I will submit an official report through your website.

    This archive was generated by hypermail 2.1.5 : Thu Mar 11 2010 - 11:41:50 CST