RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Philippe Verdy (
Date: Mon Oct 01 2007 - 22:22:35 CST

  • Next message: Dmitry Turin: "Re[3]: Hardwarily formating colour and size of font (3 new symbols)"

    And my approach is to leave the freedom between the #2, #3 and #4 solutions
    (I support advancing one tailored element, the set of unitary elements
    depending on the input universe being considered,which may be
    locale-sensitive). But I also support #1 in the classic C locale (more
    exactly, this is not necessarily one code point in #1 and #2, it may be one
    code unit as well, as this depends also of the input universe is what “.”
    can match (ignoring here the special case of line terminators whose effect
    on the input often depends on an option flag).


    But for any Regexps that supports some “\q{}” grouping syntax at least
    within sets, #1 is certainly the worst option.


    The problem with #2 is that when advancing one code point, it may cross a
    cluster boundary implied in negated sets like [^\N{C WITH CEDILLA AND
    ACCENT}] where the character referenced by name has several canonical
    equivalents with variable lengths (in terms of number of code points matched
    which varies here from 1 to 3, with 5 possible encodings, and possibly more
    if there are additional combining characters after these sequences, not
    necessarily paired or ordered in the canonical order).


    This leaves another difficulty for the actual meaning of an expression like
    [\x0-\x10FFFF] which includes many characters with multiple decompositions:
    if more than one code point can be matched by a regexp like [é] which
    contains a canonically decomposable character, then

    * are all the possible decompositions included in the set in addition
    to single code points in the range?
    * How to interpret [à-é] if it depends on the normalization form
    (let’s ignore here the question of locales which affects the ordering)?
    * what is the meaning of ranges like this in terms of relative
    ordering and what the minimum and maximum bounds are including or excluding,
    and how to order \q{rr} itself if it’s one of the range bounds?


    Many efforts have been invested by Unicode to define things like collation
    and case-mappings, but all this effort is currently left unused in the
    current UTS; this means that it matches no correct locale, but only a very
    basic locale (not even the Basic default Unicode locale as it does not seam
    to consider at least the DUCET). If you want to work only at the codepoint
    level, then the exposed Regexp will work reliably only in the basic “C”
    locale (or its minor “POSIX” variant), so all other character properties
    should be used according to that basic locale (excluding then advanced case
    mappings, and all multilevel collation orders and the DUCET).


    So I hope that the proposal, in the current state, will not discard/forbid
    further advanced features using a more complete set of Unicode properties
    and the support of locales, and that space will be left for at least
    experimentation without violating the current rules exposed, in away that
    would make the existing Regexps incompatible (in their interpretation) with
    the future evolutions.



    De : [] De la
    part de Mark Davis
    Envoyé : mardi 2 octobre 2007 03:32
    À : Mike; Andy Heninger
    Cc : Unicode
    Objet : Proposal for matching negated sets (was Re: New Public Review Issue:
    Proposed Update UTS #18)


    Getting back to this topic, given the upcoming meeting we should prepare
    some kind of summary and submit it. (Nothing on this list is seen by the UTC
    unless someone writes up a proposal or submits feedback.)

    Here is my attempt (grabbing some text from an email of Sep 23). Please let
    me know if you have any feedback before I send it in.

    There are a few viable approaches to matching negated sets. Let's take [a-z
    \q{ch} \q{rr}] as an example. It is productive to look at the "unrolled
    version of this, using the fact that the following are equivalent (ignoring
    capturing for the moment):

    [a-z \q{ch} \q{rr}]
    ( [a-z] | ch | rr )

    Then the question amounts to what the 'inverse' of ( [a-z] | ch | rr ) is
    supposed to be equivalent to. Here are some possibilities:

    1. [^a-z] -- fail with strings starting with a-z and otherwise advance
    by one code point
    2. (?! [a-z] | ch | rr ) [\x{0}-\x{10FFFF}] -- fail with strings
    starting with a-z, ch, or rr, and otherwise advance by one code point
    3. (?! [a-z] | ch | rr ) \X -- fail with strings starting with a-z, ch,
    or rr, and otherwise advance by grapheme cluster
    4. (?! [a-z] | ch | rr ) \X -- but with tailored \X -- fail with
    strings starting with a-z, ch, or rr, and otherwise advance by one tailored
    grapheme cluster (for traditional spanish, would include ch, ll, rr, and