Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Mark Davis (mark.davis@icu-project.org)
Date: Mon Oct 01 2007 - 19:32:04 CST

  • Next message: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"

    Getting back to this topic, given the upcoming meeting we should prepare
    some kind of summary and submit it. (Nothing on this list is seen by the UTC
    unless someone writes up a proposal or submits feedback.)

    Here is my attempt (grabbing some text from an email of Sep 23). Please let
    me know if you have any feedback before I send it in.

    There are a few viable approaches to matching negated sets. Let's take [a-z
    \q{ch} \q{rr}] as an example. It is productive to look at the "unrolled
    version of this, using the fact that the following are equivalent (ignoring
    capturing for the moment):

    [a-z \q{ch} \q{rr}]
    and
    ( [a-z] | ch | rr )

    Then the question amounts to what the 'inverse' of ( [a-z] | ch | rr )
    issupposed to be equivalent to. Here are some possibilities:

       1. [^a-z] -- fail with strings starting with a-z and otherwise advance
       by one code point
       2. (?! [a-z] | ch | rr ) [\x{0}-\x{10FFFF}] -- fail with strings
       starting with a-z, ch, or rr, and otherwise advance by one code point
       3. (?! [a-z] | ch | rr ) \X -- fail with strings starting with a-z,
       ch, or rr, and otherwise advance by grapheme cluster
       4. (?! [a-z] | ch | rr ) \X -- but with tailored \X -- fail with
       strings starting with a-z, ch, or rr, and otherwise advance by one tailored
       grapheme cluster (for traditional spanish, would include ch, ll, rr, and
       thus allow "ll" as a match).

    Comments:
    #1 is the current approach.
    #2 seems more intuitive, and is worth proposing as an alternative
    #3 would only be compatible in a mode where [^a-z] also matched at a
    grapheme level. So it couldn't be the default. In such a mode, however, it
    would be the natural extension of #2.
    #4 is a possibility, but only for locale-sensitive regex (which are
    uncommon). In such a mode, it would be the natural extension of #3.

    The proposal is to have #2 as a new recommended default approach in an
    proposed updated to UTS#18, for public review and comment. #3 and #4 could
    be included also for the appropriate modes, although they are less
    important.

    -- 
    Mark
    


    This archive was generated by hypermail 2.1.5 : Mon Oct 01 2007 - 19:35:37 CST