RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Sep 25 2007 - 15:56:16 CDT

  • Next message: Philippe Verdy: "RE: Public Review Issues update: UAX #31"

    Mark Davis wrote:
    > On 9/23/07, Mike <mike-list@pobox.com> wrote:
    > >> As far as your other comments (copied below), the issue is as to what
    > > [^a-z ñ \q{ch} \q{ll} \q{rr}] would mean. Here was roughly our
    reasoning.
    > >> • The meaning, without the ^, is a set of strings {"a", "b", ..., "z",
    > > "ñ", "ch", "ll", "rr"}.
    > >> • The set inversion would be the set of all other strings. So that
    would
    > > include "0", "A", ... but also "New York", and "onomotopaeic", and so
    on. An
    > > infinite set.
    >
    > Why do you assume such huge extension of the input universe ?
    >
    > The only needed thing is that the inversion set has to be universe minus
    the
    > positive set, and that /./ has to include all possible positive sets, in
    > such a way that {/[set]/, /[^set]/} is an exact partition of the universe
    of
    > acceptable input units.

    > I think it is wrong to think of [^set] as being some 'universe' minus
    > [set].  The way I think of it is that [^set] matches anywhere [set]
    > does not match.  As a simple example, consider the expression:
    >
    >       /^[\q{ch}].*/      # text must start with 'ch'
    >
    > This will match the input strings "churro" or "chimichanga", but won't
    > match "caliente."
    >
    > Now if we negate the set, we have the expression:
    >
    >       /^[^\q{ch}].*/     # text must not start with 'ch'
    >
    > Then the matching behavior is just the opposite: "caliente" matches,
    > while "churro" and "chimichanga" do not.  In my opinion, this is what
    > an end user would expect.
    > The difficultly is masked by your use of .* afterwards.

    That's true but...

    > Take /[\q{ch}]/. It matches all strings consisting of "ch".
    > By your logic, /[^\q{ch}]/ matches all strings that are not "ch",
    > including, as I said, "New York", and "onomotopaeic", and this entire
    email.

    This assumption is false. You are assuming that "." matches everything. In
    my opinion it only matches what the user sees as a single character in its
    input universe. So it will most probably match the "N" in "New York" but not
    the whole string.

    My opinion is that what is relevant is the set of collation elements in the
    user's input alphabet associated to his locale (or locale extended by the
    inclusion, at end of his alphabet if they are absent from it, of additional
    collation elements introduced in the regexp itself by "\q{...}".

    Here the explicit use of "\q{ch}" adds the collation element {ch} to the
    default alphabet, at end of the "." universe, so "." becomes
    (?:[\x{0}-\x{10FFF}|\q{ch}) and is sorted so that "ch" sorts after all other
    single codepoints (unless the locale defines an explicit sort order for this
    collation element (in which case "." is just equivalent to the complete
    tailored list of Unicode collation elements including all existing
    single-code points). The sort order of the input universe is relevant for
    defining a clear meaning of ranges in set notations like [a-z], without it,
    you don't know which collation elements this range includes ordoes not
    include.

    Note that I make a distinction between "\q{ch}" and "ch": the first
    explicitly defines an unbreakable collation element, the second one does not
    and is then interpreted as "\q{c}\q{h}" (unless the user's locale already
    interprets it as a single collation element). This does not change anything
    in regexps, except in the interpretation of "." and elements specified
    within "[set]" and "[^set]". So a regexp /ch/ or /\q{ch}/ matches the same
    thing, the former being more efficient than the second one because it does
    not alter the input universe.

    The effect of \q{...} on the "." universe could be limited to only a
    specific part of the regexp, for example if the regexp defines an explicit
    locale context, which is the only context affected: in

             /(?locale=br![a-z\q{rr}])a.s/

    the addition of the "rr" collation element to the Breton input universe
    would not affect the interpretation of "." in this regexp outside of the
    delimited Breton locale... But it would affect the meaning of "." in:

            /(?locale=br![a-z\q{rr}]a.)s/

    where the "charrs" input text would match here (it matches as if it was read
    as "\q{ch}\q{a}\q{rr}\q{s}", where "\q{rr}" in the input is an accepted
    instance of "." in the regexp) but not in the former regexp (because the
    input would be read as "\q{ch}\q{a}\q{r}\q{r}\q{s}" and the second
    occurrence of "\q{r}" would not match "s" in the regexp;
    with the former regexp, you would find a match only with "hars", read as
    "\q{h}\q{a}\q{r}\q{s}", but not with "harrs" read as
    "\q{h}\q{a}\q{r}\q{r}\q{s}" (because the \q{rr} collation element present in
    the former regexp is not significant outside of the explicit Breton locale
    context it alters).

    More generally, my view is that NO implementation should ever make a
    difference between [set] and (s|e|t). They should match the same thing,
    ideally with the same performance. Both regexps should "compile" into
    exactly the same parsing graph (the fact that one will result in a lookup
    from a bitset or binary lookup table and not the other is not relevant for
    users and not for me: it's an implementation defect, if it affects the
    performance of your regexp matcher, and a severe bug if they don't match
    exactly the same things).

    For me, the main usage of the "[set]" and "[^set]" notation, not supported
    with the notation of alternatives with "|" is the possibility of including
    ranges of collation elements like [a-z] without having to list all the
    alternatives it means, i.e. "(a|b|...|z)" here (incomplete regexp, this
    would not work with this exact syntax), but even in this case, this could
    produce exactly the same thing:

    The set notation should be first scanned by splitting it using default
    grapheme cluster boundaries (in fact the whole regexp should be scanned like
    this), and then parsed to recognize the special meaning of "[", "^", "-",
    "]" and "\q{...}" used in the set notation; all other default grapheme
    cluster boundaries are then interpreted as being part of the input universe.
    The "\q{...}" will automatically affect the current locale context by
    extending the "." universe (if needed, because these collation elements may
    already be present in that locale).



    This archive was generated by hypermail 2.1.5 : Tue Sep 25 2007 - 15:57:41 CDT