Re: New Public Review Issue: Proposed Update UTS #18

From: Mike (mike-list@pobox.com)
Date: Sun Sep 23 2007 - 15:47:29 CDT

Next message: Mike: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"

Previous message: Mark Davis: "Re: New Public Review Issue: Proposed Update UTS #18"
In reply to: Mark Davis: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> I think it is wrong to think of [^set] as being some 'universe' minus
> [set]. The way I think of it is that [^set] matches anywhere [set]
> does not match. As a simple example, consider the expression:
>
> /^[\q{ch}].*/ # text must start with 'ch'
>
> This will match the input strings "churro" or "chimichanga", but won't
> match "caliente."
>
> Now if we negate the set, we have the expression:
>
> /^[^\q{ch}].*/ # text must not start with 'ch'
>
> Then the matching behavior is just the opposite: "caliente" matches,
> while "churro" and "chimichanga" do not. In my opinion, this is what
> an end user would expect.
>
> The difficultly is masked by your use of .* afterwards.
>
> Take /[\q{ch}]/. It matches all strings consisting of "ch". By your
> logic, /[^\q{ch}]/ matches all strings that are not "ch", including, as
> I said, "New York", and "onomotopaeic", and this entire email.

No, that is not my logic. /[^\q{ch}]/ matches all *characters* that
are not "ch". Whether this should mean "match the next code point"
or a whole grapheme cluster is debatable.

Here is a repeat of my example without the .*:

/[\q{ch}]/ # match "ch" as a single character

This will match "cucaracha" starting at the 7th code point, consisting
of the 7th and 8th code points.

The negated set would be:

/[^\q{ch}]/ # don't match the character "ch"

This pattern will match "cucaracha" at any position except the 7th
code point. So repeatedly applying the match operation would return:
"c" "u" "c" "a" "r" "a" "h" "a". I think it would be even better if
the "h" was not returned....

> I think a clearer way of thinking about it is that [a-z \q{ch} \q{rr}]
> is equivalent to ( [a-z] | ch | rr ) [actually to (?:[a-z]|ch|rr), but
> let's forget about capturing for the moment to make things simpler.]
> Then the question is what the 'inverse' of ( [a-z] | ch | rr ) is
> supposed to be equivalent to. There are a variety of possibilities:
>
> 1. [^a-z] -- fail with strings starting with a-z and otherwise
> advance by one code point
> 2. (?! [a-z] | ch | rr ) [\x{0}-\x{10FFFF}] -- fail with strings
> starting with a-z, ch, or rr, and otherwise advance by one code point
> 3. (?! [a-z] | ch | rr ) \X -- fail with strings starting with a-z,
> ch, or rr, and otherwise advance by grapheme cluster
> 4. (?! [a-z] | ch | rr ) \X -- but with tailored \X -- fail with
> strings starting with a-z, ch, or rr, and otherwise advance by
> tailored grapheme cluster (for traditional spanish, would include
> ch, ll, rr, and thus allow "ll")
> 5. (?! [a-z] | ch | rr ) [\x{0}-\x{10FFFF}]* -- fail with strings
> starting with a-z, ch, or rr, and otherwise advance by any amount
> 6. (?! ([a-z] | ch | rr) $) [\x{0}-\x{10FFFF}]* -- fail with strings
> exactly matching a-z, ch, or rr, and otherwise advance by any amount
> 7. illegal -- you can't use ^ with sets containing strings.
>
> #1 is the current approach in UTS18. #5 and #6 are the ones I was
> against. They clearly wouldn't work; they would screw up any use of
> existing ranges in Regex. #7 disallows the use of user-perceived
> characters like x+acute, although it might be a good choice for the
> non-grapheme-cluster-recognizing mode. #4 only works with
> language-sensitive modes, which are somewhat tenuous. #2 and #3 are
> possibilities.

I have been arguing for #2 or #3 all along. The problem with #1 is
that it only achieves the correct result if the first letter of each
grapheme cluster is also in the set, which won't always be the case.

> Note also that the UTC is proposing a somewhat more inclusive grapheme
> cluster than the default, one that is still language-neutral. The
> proposed update to UAX #31 will be going up soon.

UAX #31 is for Identifier and Pattern Syntax; did you mean UAX #29
Text Boundaries?

Mike

Next message: Mike: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Previous message: Mark Davis: "Re: New Public Review Issue: Proposed Update UTS #18"
In reply to: Mark Davis: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Sep 23 2007 - 15:49:53 CDT