RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy ([email protected])
Date: Tue Sep 25 2007 - 15:56:16 CDT

Next message: Philippe Verdy: "RE: Public Review Issues update: UAX #31"

Previous message: Rick McGowan: "Public Review Issues update: UAX #29 and UAX #31"
In reply to: Mark Davis: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark Davis wrote:
> On 9/23/07, Mike <[email protected]> wrote:
> >> As far as your other comments (copied below), the issue is as to what
> > [^a-z ñ \q{ch} \q{ll} \q{rr}] would mean. Here was roughly our
reasoning.
> >> • The meaning, without the ^, is a set of strings {"a", "b", ..., "z",
> > "ñ", "ch", "ll", "rr"}.
> >> • The set inversion would be the set of all other strings. So that
would
> > include "0", "A", ... but also "New York", and "onomotopaeic", and so
on. An
> > infinite set.
>
> Why do you assume such huge extension of the input universe ?
>
> The only needed thing is that the inversion set has to be universe minus
the
> positive set, and that /./ has to include all possible positive sets, in
> such a way that {/[set]/, /[^set]/} is an exact partition of the universe
of
> acceptable input units.

> I think it is wrong to think of [^set] as being some 'universe' minus
> [set].  The way I think of it is that [^set] matches anywhere [set]
> does not match.  As a simple example, consider the expression:
>
>       /^[\q{ch}].*/      # text must start with 'ch'
>
> This will match the input strings "churro" or "chimichanga", but won't
> match "caliente."
>
> Now if we negate the set, we have the expression:
>
>       /^[^\q{ch}].*/     # text must not start with 'ch'
>
> Then the matching behavior is just the opposite: "caliente" matches,
> while "churro" and "chimichanga" do not.  In my opinion, this is what
> an end user would expect.
> The difficultly is masked by your use of .* afterwards.

That's true but...

> Take /[\q{ch}]/. It matches all strings consisting of "ch".
> By your logic, /[^\q{ch}]/ matches all strings that are not "ch",
> including, as I said, "New York", and "onomotopaeic", and this entire
email.

This assumption is false. You are assuming that "." matches everything. In
my opinion it only matches what the user sees as a single character in its
input universe. So it will most probably match the "N" in "New York" but not
the whole string.

My opinion is that what is relevant is the set of collation elements in the
user's input alphabet associated to his locale (or locale extended by the
inclusion, at end of his alphabet if they are absent from it, of additional
collation elements introduced in the regexp itself by "\q{...}".

Here the explicit use of "\q{ch}" adds the collation element {ch} to the
default alphabet, at end of the "." universe, so "." becomes
(?:[\x{0}-\x{10FFF}|\q{ch}) and is sorted so that "ch" sorts after all other
single codepoints (unless the locale defines an explicit sort order for this
collation element (in which case "." is just equivalent to the complete
tailored list of Unicode collation elements including all existing
single-code points). The sort order of the input universe is relevant for
defining a clear meaning of ranges in set notations like [a-z], without it,
you don't know which collation elements this range includes ordoes not
include.

Note that I make a distinction between "\q{ch}" and "ch": the first
explicitly defines an unbreakable collation element, the second one does not
and is then interpreted as "\q{c}\q{h}" (unless the user's locale already
interprets it as a single collation element). This does not change anything
in regexps, except in the interpretation of "." and elements specified
within "[set]" and "[^set]". So a regexp /ch/ or /\q{ch}/ matches the same
thing, the former being more efficient than the second one because it does
not alter the input universe.

The effect of \q{...} on the "." universe could be limited to only a
specific part of the regexp, for example if the regexp defines an explicit
locale context, which is the only context affected: in

/(?locale=br![a-z\q{rr}])a.s/

the addition of the "rr" collation element to the Breton input universe
would not affect the interpretation of "." in this regexp outside of the
delimited Breton locale... But it would affect the meaning of "." in:

/(?locale=br![a-z\q{rr}]a.)s/

where the "charrs" input text would match here (it matches as if it was read
as "\q{ch}\q{a}\q{rr}\q{s}", where "\q{rr}" in the input is an accepted
instance of "." in the regexp) but not in the former regexp (because the
input would be read as "\q{ch}\q{a}\q{r}\q{r}\q{s}" and the second
occurrence of "\q{r}" would not match "s" in the regexp;
with the former regexp, you would find a match only with "hars", read as
"\q{h}\q{a}\q{r}\q{s}", but not with "harrs" read as
"\q{h}\q{a}\q{r}\q{r}\q{s}" (because the \q{rr} collation element present in
the former regexp is not significant outside of the explicit Breton locale
context it alters).

More generally, my view is that NO implementation should ever make a
difference between [set] and (s|e|t). They should match the same thing,
ideally with the same performance. Both regexps should "compile" into
exactly the same parsing graph (the fact that one will result in a lookup
from a bitset or binary lookup table and not the other is not relevant for
users and not for me: it's an implementation defect, if it affects the
performance of your regexp matcher, and a severe bug if they don't match
exactly the same things).

For me, the main usage of the "[set]" and "[^set]" notation, not supported
with the notation of alternatives with "|" is the possibility of including
ranges of collation elements like [a-z] without having to list all the
alternatives it means, i.e. "(a|b|...|z)" here (incomplete regexp, this
would not work with this exact syntax), but even in this case, this could
produce exactly the same thing:

The set notation should be first scanned by splitting it using default
grapheme cluster boundaries (in fact the whole regexp should be scanned like
this), and then parsed to recognize the special meaning of "[", "^", "-",
"]" and "\q{...}" used in the set notation; all other default grapheme
cluster boundaries are then interpreted as being part of the input universe.
The "\q{...}" will automatically affect the current locale context by
extending the "." universe (if needed, because these collation elements may
already be present in that locale).

Next message: Philippe Verdy: "RE: Public Review Issues update: UAX #31"
Previous message: Rick McGowan: "Public Review Issues update: UAX #29 and UAX #31"
In reply to: Mark Davis: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Sep 25 2007 - 15:57:41 CDT