RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 01 2007 - 22:22:35 CST

Next message: Dmitry Turin: "Re[3]: Hardwarily formating colour and size of font (3 new symbols)"

Previous message: Philippe Verdy: "RE: Fish (was Re: Marks)"
In reply to: Mark Davis: "Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Next in thread: Mike: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

And my approach is to leave the freedom between the #2, #3 and #4 solutions
(I support advancing one tailored element, the set of unitary elements
depending on the input universe being considered,which may be
locale-sensitive). But I also support #1 in the classic C locale (more
exactly, this is not necessarily one code point in #1 and #2, it may be one
code unit as well, as this depends also of the input universe is what “.”
can match (ignoring here the special case of line terminators whose effect
on the input often depends on an option flag).

But for any Regexps that supports some “\q{}” grouping syntax at least
within sets, #1 is certainly the worst option.

The problem with #2 is that when advancing one code point, it may cross a
cluster boundary implied in negated sets like [^\N{C WITH CEDILLA AND
ACCENT}] where the character referenced by name has several canonical
equivalents with variable lengths (in terms of number of code points matched
which varies here from 1 to 3, with 5 possible encodings, and possibly more
if there are additional combining characters after these sequences, not
necessarily paired or ordered in the canonical order).

This leaves another difficulty for the actual meaning of an expression like
[\x0-\x10FFFF] which includes many characters with multiple decompositions:
if more than one code point can be matched by a regexp like [é] which
contains a canonically decomposable character, then

* are all the possible decompositions included in the set in addition
to single code points in the range?
* How to interpret [à-é] if it depends on the normalization form
(let’s ignore here the question of locales which affects the ordering)?
* what is the meaning of ranges like this in terms of relative
ordering and what the minimum and maximum bounds are including or excluding,
and how to order \q{rr} itself if it’s one of the range bounds?

Many efforts have been invested by Unicode to define things like collation
and case-mappings, but all this effort is currently left unused in the
current UTS; this means that it matches no correct locale, but only a very
basic locale (not even the Basic default Unicode locale as it does not seam
to consider at least the DUCET). If you want to work only at the codepoint
level, then the exposed Regexp will work reliably only in the basic “C”
locale (or its minor “POSIX” variant), so all other character properties
should be used according to that basic locale (excluding then advanced case
mappings, and all multilevel collation orders and the DUCET).

So I hope that the proposal, in the current state, will not discard/forbid
further advanced features using a more complete set of Unicode properties
and the support of locales, and that space will be left for at least
experimentation without violating the current rules exposed, in away that
would make the existing Regexps incompatible (in their interpretation) with
the future evolutions.

_____

De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
part de Mark Davis
Envoyé : mardi 2 octobre 2007 03:32
À : Mike; Andy Heninger
Cc : Unicode
Objet : Proposal for matching negated sets (was Re: New Public Review Issue:
Proposed Update UTS #18)

Getting back to this topic, given the upcoming meeting we should prepare
some kind of summary and submit it. (Nothing on this list is seen by the UTC
unless someone writes up a proposal or submits feedback.)

Here is my attempt (grabbing some text from an email of Sep 23). Please let
me know if you have any feedback before I send it in.

There are a few viable approaches to matching negated sets. Let's take [a-z
\q{ch} \q{rr}] as an example. It is productive to look at the "unrolled
version of this, using the fact that the following are equivalent (ignoring
capturing for the moment):

[a-z \q{ch} \q{rr}]
and
( [a-z] | ch | rr )

Then the question amounts to what the 'inverse' of ( [a-z] | ch | rr ) is
supposed to be equivalent to. Here are some possibilities:

1. [^a-z] -- fail with strings starting with a-z and otherwise advance
by one code point
2. (?! [a-z] | ch | rr ) [\x{0}-\x{10FFFF}] -- fail with strings
starting with a-z, ch, or rr, and otherwise advance by one code point
3. (?! [a-z] | ch | rr ) \X -- fail with strings starting with a-z, ch,
or rr, and otherwise advance by grapheme cluster
4. (?! [a-z] | ch | rr ) \X -- but with tailored \X -- fail with
strings starting with a-z, ch, or rr, and otherwise advance by one tailored
grapheme cluster (for traditional spanish, would include ch, ll, rr, and