Re: property, character, and sequence name loose matching

From: karl williamson (public@khwilliamson.com)
Date: Thu Mar 11 2010 - 11:34:01 CST

Next message: philip chastney: "Fw: Re: ß vs. ſs"

Previous message: Mark Davis ☕: "Re: property, character, and sequence name loose matching"
In reply to: Kenneth Whistler: "Re: property, character, and sequence name loose matching"
Next in thread: Mark Davis ☕: "Re: property, character, and sequence name loose matching"
Reply: Mark Davis ☕: "Re: property, character, and sequence name loose matching"
Reply: Asmus Freytag: "Re: property, character, and sequence name loose matching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Kenneth Whistler wrote:
>>>> The loose matching rules in TR18 say to ignore white space, underscores,
>>>> and hyphens. That means that someone could insert white space into the
>>>> middle of what is supposed to be a single word, like
>>>> \p{s c r i p t: greek}. Same for character names.
>>> Actually, it doesn't mean that you can arbitrarily ignore
>>> the identifier syntax of particular formalizations.
>> I don't understand your sentence. I'm guessing you mean that
>> 's c r i p t' is not the same as 'script', even though tr18 says "case
>> distinctions, whitespace, hyphens, and underbar are ignored." If so,
>> shouldn't tr18 be clarified?
>
> I should have said "pattern syntax" rather than "identifier syntax"
> in this case, but the point is that while UTS #18 makes
> a general statement about how pattern matching for property
> names and values should be done, you still have to pay attention
> to the details of the actual implementations.
>
> Without checking an actual implementation of java.util.regex Class
> Pattern, I don't know whether:
>
> \p{_________ -------s c r i p________--_- t ___: greek}
>
> would actually match the Unicode Script property or would
> throw a PatternSyntaxException.
>
> You can try it and find out, I suppose. But that isn't
> really so much an issue for UTS #18 but rather something to take
> up with the implementers of Java, Perl, and other regex
> engines.
>

The reason I'm asking this is that I am an implementer of Perl's regex
engine. I didn't realize that that fact would be germane to my
question, so I didn't mention it. Sorry. I'm not interested in what's
advisable or not to use; I'm interested in what the engine should accept
versus throw an exception on, and hence how I need to write the engine.
So I am seeking clarification of what TUS would like from an
implementation.

In the past Perl has not accepted the full loose matching rules, but now
I have implemented what I thought were them for the soon-to-be-released
Perl 5.12. Perl 5 is an open-source project; I am a volunteer with some
background and interest in the topic, but not an expert. I am, however,
an expert software developer, retired now, so I have some time to devote
to this.

Based on my reading of TR18 and UAX44, I changed the Perl regex engine
so it would parse things like what Ken mentioned above:
\p{_________ -------s c r i p________--_- t ___: greek}
as meaning \p{script:greek}, without throwing an exception. Again, it's
not advisable for someone to write something like that, but it appears
to me to be permissible, and so I wrote the regex engine to handle it.

I am starting out to add loose matching to the regex engine for
character names for the next release of Perl 5 (and I anticipate adding
support for named sequences in Perl by then, so for them as well).

Effectively, it was pointed out that my reading of what I thought was
the plain wording of the standard might be wrong, since, if there can be
a space between any two characters, the concept of word is meaningless,
and therefore the concept of a medial hyphen is as well. Conversely, if
words can be run-on together, all hyphens (except at the very beginning
and end of the string) become medial, and so the distinction is also
meaningless.

>>> What it means is that such names as:
>>>
>>> CHARACTER BZZT
>>> CHARACTER B-ZZ-T
>>> CHARACTER BZ-ZT
>> What about
>> CHARACER BZ--ZT
>> ?
>
> What about it?
>
> "CHARACER BZ--ZT" won't loose match "CHARACTER BZZT", because
> the first one is missing the "T" in "CHARACTER". But then,
> I don't suppose that was your question.

Sorry for the typo, and thanks for figuring out what I really meant.
>
> The loose matching rules would not distinguish:
>
> CHARACTER BZZT
>
> from
>
> CHARACTER BZ--ZT
>
> or for that matter, from
>
> CHARACTER BZ---------------------------------------------------ZT
>
> But if your question is, rather, would "CHARACTER BZ--ZT" be
> allowed as a Unicode character name, the answer is no.
> But the reason for that cannot be found in UTS #18. The reason
> is because it would be stupid and pointless to name a character that way,
> and the folks in the relevant maintenance committees are not
> stupid.

Of course
>
> In general, if there is something unclear about matching rules
> in the Unicode Standard, a more fruitful direction would be to
> examine the relevant text in the proposed update for UAX #44
> and suggest any required clarifications to the UTC, if there
> really is an issue of ambiguity in that text. See:
>
> http://www.unicode.org/reports/tr44/tr44-5.html#Matching_Rules
>
> --Ken
>

Implementers need highly precise wording in a standard. So this
sentence in the current UAX44 draft (thanks for the link) is problematic
for me:

UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.

If whitespace is ignored, then all hyphens are medial, and as tr18
points out, there would then be two other confusable cases, involving
what you might think of as "initial" hyphens.

So, I'm in a hurry. I don't have time to wait for the next draft of
UAX44. Perl 5.12 is in a code freeze. If I misread what you guys
intended, it would be good if I knew immediately, so I could go and
plead that the revisions I would have to write be allowed in so that the
defective version would never get published.

My sense, though, is that I didn't misread it, that the statements made
in UAX34 and 44 are imprecise, and based on your responses to this
email, I will submit an official report through your website.
>
>

Next message: philip chastney: "Fw: Re: ß vs. ſs"
Previous message: Mark Davis ☕: "Re: property, character, and sequence name loose matching"
In reply to: Kenneth Whistler: "Re: property, character, and sequence name loose matching"
Next in thread: Mark Davis ☕: "Re: property, character, and sequence name loose matching"
Reply: Mark Davis ☕: "Re: property, character, and sequence name loose matching"
Reply: Asmus Freytag: "Re: property, character, and sequence name loose matching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Mar 11 2010 - 11:41:50 CST