# Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Hans Aberg (haberg@math.su.se)
Date: Mon Oct 08 2007 - 09:17:35 CDT

• Next message: Marion Gunn: "Re: Emoticons (was: Root and fraction (2 new symbols))"

Taking the (language) complement of a RE (regular expression) was
discussed here:

The post by Karsten Nyblad describes one way to do it, by converting
the RE to a DFA, finding the complement DFA, and then converting back
to a RE. Perhaps does not generate nice-looking REs.

Hans Åberg

On 2 Oct 2007, at 03:32, Mark Davis wrote:

> Getting back to this topic, given the upcoming meeting we should
> prepare some kind of summary and submit it. (Nothing on this list
> is seen by the UTC unless someone writes up a proposal or submits
> feedback.)
>
> Here is my attempt (grabbing some text from an email of Sep 23).
> Please let me know if you have any feedback before I send it in.
>
> There are a few viable approaches to matching negated sets. Let's
> take [a-z \q{ch} \q{rr}] as an example. It is productive to look at
> the "unrolled version of this, using the fact that the following
> are equivalent (ignoring capturing for the moment):
>
> [a-z \q{ch} \q{rr}]
> and
> ( [a-z] | ch | rr )
>
> Then the question amounts to what the 'inverse' of ( [a-z] | ch |
> rr ) is supposed to be equivalent to. Here are some possibilities:
> [^a-z] -- fail with strings starting with a-z and otherwise advance
> by one code point
> (?! [a-z] | ch | rr ) [\x{0}-\x{10FFFF}] -- fail with strings
> starting with a-z, ch, or rr, and otherwise advance by one code point
> (?! [a-z] | ch | rr ) \X -- fail with strings starting with a-z,
> ch, or rr, and otherwise advance by grapheme cluster
> (?! [a-z] | ch | rr ) \X -- but with tailored \X -- fail with
> strings starting with a-z, ch, or rr, and otherwise advance by one
> tailored grapheme cluster (for traditional spanish, would include
> ch, ll, rr, and thus allow "ll" as a match).
> #1 is the current approach.
> #2 seems more intuitive, and is worth proposing as an alternative
> #3 would only be compatible in a mode where [^a-z] also matched at
> a grapheme level. So it couldn't be the default. In such a mode,
> however, it would be the natural extension of #2.
> #4 is a possibility, but only for locale-sensitive regex (which are
> uncommon). In such a mode, it would be the natural extension of #3.
>
> The proposal is to have #2 as a new recommended default approach in
> an proposed updated to UTS#18, for public review and comment. #3
> and #4 could be included also for the appropriate modes, although
> they are less important.
>
> --
> Mark

This archive was generated by hypermail 2.1.5 : Mon Oct 08 2007 - 09:21:29 CDT