Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Mark Davis (mark.davis@icu-project.org)
Date: Mon Oct 01 2007 - 19:32:04 CST

Next message: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"

Previous message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Philippe Verdy: "RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Philippe Verdy: "RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Maybe reply: Mike: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Getting back to this topic, given the upcoming meeting we should prepare
some kind of summary and submit it. (Nothing on this list is seen by the UTC
unless someone writes up a proposal or submits feedback.)

Here is my attempt (grabbing some text from an email of Sep 23). Please let
me know if you have any feedback before I send it in.

There are a few viable approaches to matching negated sets. Let's take [a-z
\q{ch} \q{rr}] as an example. It is productive to look at the "unrolled
version of this, using the fact that the following are equivalent (ignoring
capturing for the moment):

[a-z \q{ch} \q{rr}]
and
( [a-z] | ch | rr )

Then the question amounts to what the 'inverse' of ( [a-z] | ch | rr )
issupposed to be equivalent to. Here are some possibilities:

   1. [^a-z] -- fail with strings starting with a-z and otherwise advance
   by one code point
   2. (?! [a-z] | ch | rr ) [\x{0}-\x{10FFFF}] -- fail with strings
   starting with a-z, ch, or rr, and otherwise advance by one code point
   3. (?! [a-z] | ch | rr ) \X -- fail with strings starting with a-z,
   ch, or rr, and otherwise advance by grapheme cluster
   4. (?! [a-z] | ch | rr ) \X -- but with tailored \X -- fail with
   strings starting with a-z, ch, or rr, and otherwise advance by one tailored
   grapheme cluster (for traditional spanish, would include ch, ll, rr, and
   thus allow "ll" as a match).

Comments:
#1 is the current approach.
#2 seems more intuitive, and is worth proposing as an alternative
#3 would only be compatible in a mode where [^a-z] also matched at a
grapheme level. So it couldn't be the default. In such a mode, however, it
would be the natural extension of #2.
#4 is a possibility, but only for locale-sensitive regex (which are
uncommon). In such a mode, it would be the natural extension of #3.

The proposal is to have #2 as a new recommended default approach in an
proposed updated to UTS#18, for public review and comment. #3 and #4 could
be included also for the appropriate modes, although they are less
important.

-- 
Mark

Next message: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"
Previous message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Philippe Verdy: "RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Philippe Verdy: "RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Maybe reply: Mike: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Oct 01 2007 - 19:35:37 CST