From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Mar 10 2010 - 14:22:03 CST
> >> The loose matching rules in TR18 say to ignore white space, underscores,
> >> and hyphens. That means that someone could insert white space into the
> >> middle of what is supposed to be a single word, like
> >> \p{s c r i p t: greek}. Same for character names.
> >
> > Actually, it doesn't mean that you can arbitrarily ignore
> > the identifier syntax of particular formalizations.
> I don't understand your sentence. I'm guessing you mean that
> 's c r i p t' is not the same as 'script', even though tr18 says "case
> distinctions, whitespace, hyphens, and underbar are ignored." If so,
> shouldn't tr18 be clarified?
I should have said "pattern syntax" rather than "identifier syntax"
in this case, but the point is that while UTS #18 makes
a general statement about how pattern matching for property
names and values should be done, you still have to pay attention
to the details of the actual implementations.
Without checking an actual implementation of java.util.regex Class
Pattern, I don't know whether:
\p{_________ -------s c r i p________--_- t ___: greek}
would actually match the Unicode Script property or would
throw a PatternSyntaxException.
You can try it and find out, I suppose. But that isn't
really so much an issue for UTS #18 but rather something to take
up with the implementers of Java, Perl, and other regex
engines.
> > What it means is that such names as:
> >
> > CHARACTER BZZT
> > CHARACTER B-ZZ-T
> > CHARACTER BZ-ZT
>
> What about
> CHARACER BZ--ZT
> ?
What about it?
"CHARACER BZ--ZT" won't loose match "CHARACTER BZZT", because
the first one is missing the "T" in "CHARACTER". But then,
I don't suppose that was your question.
The loose matching rules would not distinguish:
CHARACTER BZZT
from
CHARACTER BZ--ZT
or for that matter, from
CHARACTER BZ---------------------------------------------------ZT
But if your question is, rather, would "CHARACTER BZ--ZT" be
allowed as a Unicode character name, the answer is no.
But the reason for that cannot be found in UTS #18. The reason
is because it would be stupid and pointless to name a character that way,
and the folks in the relevant maintenance committees are not
stupid.
In general, if there is something unclear about matching rules
in the Unicode Standard, a more fruitful direction would be to
examine the relevant text in the proposed update for UAX #44
and suggest any required clarifications to the UTC, if there
really is an issue of ambiguity in that text. See:
http://www.unicode.org/reports/tr44/tr44-5.html#Matching_Rules
--Ken
This archive was generated by hypermail 2.1.5 : Wed Mar 10 2010 - 14:24:36 CST