Re: New Public Review Issue: Proposed Update UTS #18

From: Mike (
Date: Sun Sep 23 2007 - 11:06:31 CDT

  • Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"

    >> As far as your other comments (copied below), the issue is as to what
    > [^a-z \q{ch} \q{ll} \q{rr}] would mean. Here was roughly our reasoning.
    >> The meaning, without the ^, is a set of strings {"a", "b", ..., "z",
    > "", "ch", "ll", "rr"}.
    >> The set inversion would be the set of all other strings. So that would
    > include "0", "A", ... but also "New York", and "onomotopaeic", and so on. An
    > infinite set.
    > Why do you assume such huge extension of the input universe ?
    > The only needed thing is that the inversion set has to be universe minus the
    > positive set, and that /./ has to include all possible positive sets, in
    > such a way that {/[set]/, /[^set]/} is an exact partition of the universe of
    > acceptable input units.

    I think it is wrong to think of [^set] as being some 'universe' minus
    [set]. The way I think of it is that [^set] matches anywhere [set]
    does not match. As a simple example, consider the expression:

           /^[\q{ch}].*/ # text must start with 'ch'

    This will match the input strings "churro" or "chimichanga", but won't
    match "caliente."

    Now if we negate the set, we have the expression:

           /^[^\q{ch}].*/ # text must not start with 'ch'

    Then the matching behavior is just the opposite: "caliente" matches,
    while "churro" and "chimichanga" do not. In my opinion, this is what
    an end user would expect.

    > You are not required to include in /./ all codepoints in the UCS, you may
    > restrict /./ to include only assigned and valid characters....

    One problem with restricting . to match only assigned characters is
    that a text containing characters in a future version of Unicode
    will cause false negatives. In my implementation, I provide \a as
    a way to indicate you only want to match assigned characters (and
    \A matches unassigned characters), and you can specify which version
    of Unicode to use with \v{4.1}, for example.


    This archive was generated by hypermail 2.1.5 : Sun Sep 23 2007 - 11:10:43 CDT