Re: property, character, and sequence name loose matching

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Mar 10 2010 - 14:22:03 CST

  • Next message: Asmus Freytag: "Re: property, character, and sequence name loose matching"

    > >> The loose matching rules in TR18 say to ignore white space, underscores,
    > >> and hyphens. That means that someone could insert white space into the
    > >> middle of what is supposed to be a single word, like
    > >> \p{s c r i p t: greek}. Same for character names.
    > >
    > > Actually, it doesn't mean that you can arbitrarily ignore
    > > the identifier syntax of particular formalizations.
    > I don't understand your sentence. I'm guessing you mean that
    > 's c r i p t' is not the same as 'script', even though tr18 says "case
    > distinctions, whitespace, hyphens, and underbar are ignored." If so,
    > shouldn't tr18 be clarified?

    I should have said "pattern syntax" rather than "identifier syntax"
    in this case, but the point is that while UTS #18 makes
    a general statement about how pattern matching for property
    names and values should be done, you still have to pay attention
    to the details of the actual implementations.

    Without checking an actual implementation of java.util.regex Class
    Pattern, I don't know whether:

    \p{_________ -------s c r i p________--_- t ___: greek}

    would actually match the Unicode Script property or would
    throw a PatternSyntaxException.

    You can try it and find out, I suppose. But that isn't
    really so much an issue for UTS #18 but rather something to take
    up with the implementers of Java, Perl, and other regex
    engines.

    > > What it means is that such names as:
    > >
    > > CHARACTER BZZT
    > > CHARACTER B-ZZ-T
    > > CHARACTER BZ-ZT
    >
    > What about
    > CHARACER BZ--ZT
    > ?

    What about it?

    "CHARACER BZ--ZT" won't loose match "CHARACTER BZZT", because
    the first one is missing the "T" in "CHARACTER". But then,
    I don't suppose that was your question.

    The loose matching rules would not distinguish:

    CHARACTER BZZT

    from

    CHARACTER BZ--ZT

    or for that matter, from

    CHARACTER BZ---------------------------------------------------ZT

    But if your question is, rather, would "CHARACTER BZ--ZT" be
    allowed as a Unicode character name, the answer is no.
    But the reason for that cannot be found in UTS #18. The reason
    is because it would be stupid and pointless to name a character that way,
    and the folks in the relevant maintenance committees are not
    stupid.

    In general, if there is something unclear about matching rules
    in the Unicode Standard, a more fruitful direction would be to
    examine the relevant text in the proposed update for UAX #44
    and suggest any required clarifications to the UTC, if there
    really is an issue of ambiguity in that text. See:

    http://www.unicode.org/reports/tr44/tr44-5.html#Matching_Rules

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Mar 10 2010 - 14:24:36 CST