Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Hans Aberg (haberg@math.su.se)
Date: Mon Oct 08 2007 - 09:17:35 CDT

  • Next message: Marion Gunn: "Re: Emoticons (was: Root and fraction (2 new symbols))"

    Taking the (language) complement of a RE (regular expression) was
    discussed here:
    http://groups.google.com/group/comp.compilers/browse_thread/thread/
    8daad029575ed4e2/f6748b708e46af5f?lnk=st&q=&rnum=16#f6748b708e46af5f

    The post by Karsten Nyblad describes one way to do it, by converting
    the RE to a DFA, finding the complement DFA, and then converting back
    to a RE. Perhaps does not generate nice-looking REs.

       Hans Åberg

    On 2 Oct 2007, at 03:32, Mark Davis wrote:

    > Getting back to this topic, given the upcoming meeting we should
    > prepare some kind of summary and submit it. (Nothing on this list
    > is seen by the UTC unless someone writes up a proposal or submits
    > feedback.)
    >
    > Here is my attempt (grabbing some text from an email of Sep 23).
    > Please let me know if you have any feedback before I send it in.
    >
    > There are a few viable approaches to matching negated sets. Let's
    > take [a-z \q{ch} \q{rr}] as an example. It is productive to look at
    > the "unrolled version of this, using the fact that the following
    > are equivalent (ignoring capturing for the moment):
    >
    > [a-z \q{ch} \q{rr}]
    > and
    > ( [a-z] | ch | rr )
    >
    > Then the question amounts to what the 'inverse' of ( [a-z] | ch |
    > rr ) is supposed to be equivalent to. Here are some possibilities:
    > [^a-z] -- fail with strings starting with a-z and otherwise advance
    > by one code point
    > (?! [a-z] | ch | rr ) [\x{0}-\x{10FFFF}] -- fail with strings
    > starting with a-z, ch, or rr, and otherwise advance by one code point
    > (?! [a-z] | ch | rr ) \X -- fail with strings starting with a-z,
    > ch, or rr, and otherwise advance by grapheme cluster
    > (?! [a-z] | ch | rr ) \X -- but with tailored \X -- fail with
    > strings starting with a-z, ch, or rr, and otherwise advance by one
    > tailored grapheme cluster (for traditional spanish, would include
    > ch, ll, rr, and thus allow "ll" as a match).
    > Comments:
    > #1 is the current approach.
    > #2 seems more intuitive, and is worth proposing as an alternative
    > #3 would only be compatible in a mode where [^a-z] also matched at
    > a grapheme level. So it couldn't be the default. In such a mode,
    > however, it would be the natural extension of #2.
    > #4 is a possibility, but only for locale-sensitive regex (which are
    > uncommon). In such a mode, it would be the natural extension of #3.
    >
    > The proposal is to have #2 as a new recommended default approach in
    > an proposed updated to UTS#18, for public review and comment. #3
    > and #4 could be included also for the appropriate modes, although
    > they are less important.
    >
    > --
    > Mark



    This archive was generated by hypermail 2.1.5 : Mon Oct 08 2007 - 09:21:29 CDT