RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 01 2007 - 21:14:52 CST

Next message: Philippe Verdy: "RE: Fish (was Re: Marks)"

Previous message: Mark Davis: "Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
In reply to: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Asmus Freytag: "Re: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mike wrote:
> >> All that matters is
> >> that you find the longest match. [a-z\q{ch}] will match "ch"
> >> in "chinchilla" rather than just "c".
> >
> > And what can you do with the negated class? How do you define it
> > consistently?
>
> A negated class matches wherever the non-negated class doesn't
> match and vice-versa. Haven't I said this numerous times?

Yes. And I have said it too, you can reread my messages, I have not denied
it! That's exactly the reason why we need special behavior for collation
elements when they are part of a negated class or a negated union.

I personally make no difference between classes written with [] and unions
written with (...|...|...). I have already a regexp matcher that works
exactly this way. Although it still does not recognized special classes
defined by Unicode character property values, nothing prevents me from
adding it. My Regexp parser compiles the regexp into a simple vector of
32bit ints containing either single codepoints, or a primitive (coded as
negative, without any bitset because it offers far better performance).
There are only 4 operator primitives for encoding unions, disjonctions,
intersection and negations. There's a special primitive value for encoding
transitions for ranges of characters but I'll add another for encoding tests
of character properties (that define special classes).

The compiled vector takes a size in O(n), if n is the length of the source
Regexp, with a factor at worst of 4. With the very reduced set of
primitives, there's no complex switches and tests for transitions. The whole
regexp parser if fully driven by tables. I am currently also adapting the
regexp parser so that it will compile several Regexp syntaxes (this will
also be table driven, so that the number of Regexp syntaxes it will support
will be easily extensible). It supports compilation flags too (for
compatibility with ed/sed/vi or WinWord flags, or Emacs flags, or PCRE and
PHP flags...) I also want to extend the parser for the flags string so that
it will support also the table-driven syntaxes.

Note that it supports "restartable" matches (including backwards searches,
something really difficult with regexps, but also matching under user-driven
conditions not stored in the regexp itself but evaluated through callbacks,
such as computing the number of matches according to other fields already
matched, this uses a special parameter value for one of the 4 operator, i.e.
the disjonction operator primitive; this is exactly this feature that will
allow me to evaluate Unicode character properties and reply if a transition
is or is not a match; this operator evaluates both branches of disjonctions
as valid states until the matcher reaches a final state in the transition
graph).

My code has absolutely no equivalent with other implementations. I have
generalized all the concepts so that it can more easily optimize the speed
of transitions by arranging the various character classes, and also support
multiple locales. For now it cannot compile complete collation tailoring
tables provided externally, but it supports the 4-level DUCET and the
single-level binary collation order. May be I'll integrate the support for
the syntax of external collation tailoring rules, or I'll use the ICU
implementation (this is not decided, there may be performance or design
issues for handling the many internal state variables in the matcherwithout
having to take very complex decisions).

> > For what you are defining, you are not creating the necessary support
> for
> > correct support of locales, i.e. you are restricting yourself only to
> the C
> > locale, whose collation is defined strictly in binary order of code
> points
> > and nothing else (you are working in the "C" locale only.
>
> What I'm saying is that I think ranges of characters in character
> classes denoted by '[' and ']' should be limited to binary order.

Why "should"? This will be a good option only for the "C" locale, if you
want to ignore all searches taking into account the linguistic semantics and
script properties. My intent is to be able to perform searches including in
complex contexts, like in CSV tables where there are columns containing some
data containing numeric values, date values, or text values used as
selectors, and combining them to look for matching rows, or positions in
tables that match some regexp plus some specific locales. This requires of
course extensions to the syntax, but the model for doing that with a regexp
is ready, and I've already implemented each examplar feature (need to be
completed for similar features).

> If you want to define a character class that works according to a
> locale, I'm suggesting that you want different syntax for that to
> avoid confusion and unexpected behavior. My previous example of
> Hawaiian [a-z] being equivalent to [aeiouhklmnpw] illustrates the
> problem.

Well. My regexp engine will support several syntaxes, including the basic
one that you will document in the UTS... What I am suggesting to you is to
limit the use of terms for "SHOULD" or "MUST", so that it will not forbid
locale-sensitive searches using the same operators.

> Obviously I don't have this restriction; this whole thread of
> discussion started because I found a problem with the way the
> UTS suggested an implementation of a negated character class
> containing collation elements would work.

And I see that we have different views for handling this problem in a
consistent way. I preferred the approach that gives meaning to the most
generic cases instead of focusing on a specific work-around.

> > * You won't recognize any Unicode canonical equivalence in regexps. (But
> > then why are you recognizing them in scanned texts? This is
> inconsistent.)
>
> You are assuming details about my implementation that aren't true.
> Right now both the regular expression and scanned text are internally
> converted to NFD to handle canonical equivalents. The exception is
> inside character classes. Again I'll use Hangul for the example:

OK. But I'm not enforcing this on Regexps. I just wanted to make sure that
every canonically equivalent strings representing regexps (in their own
syntax) would return exactly the same set of matches from any texts that are
canonically equivalent to each others.

> Suppose you have the character class [\u1100-\u1110\u1161-\u1168],
> but are able to enter the actual code points instead of using \u.
> It represents a range of L code points, together with some V code
> points. You are saying that the adjacent U+1110 and U+1161 should
> combine together into a precomposed LV syllable, but that completely
> messes up the meaning of the character class (making it nonsensical).
>
> > * You won't be able to recognize case-mappings consistently (for case
> > insensitive searches), because collation elements will all be distinct
> with
> > only primary differences, and no further levels.
>
> For case-insensitivity, I compile the regular expression differently
> and also perform NFD->CaseFold->NFD on the text. I still don't
> understand why you need to think in terms of collation....

That's because Case-Folding is not enough for locale-sensitive searches. We
also need full case mapping, plus special case mappings including
contractions or expansions. These can be handled in a mixed collation table,
at the case-level (because I also need the multilevel capabilities of the
UCA).

> > * Even if you restrict to only the set of primary differences, the only
> > case-mappings you will be able to recognize are the simple one-to-one
> case
> > mappings defined in the main UCD file, excluding special mappings (like
> the
> > consistent distictions of dotted and undotted Turkic "I"... or finding
> > matches containing "SS" when a German sharp s is specified in the search
> > string... or allowing matches for final variants of greek letters)
>
> I don't see why these cases wouldn't work if they are supported by
> CaseFolding.txt. I have tried matching SHARP S with "ss" and it
> works.

It breaks the ordering (I want to be able, for example, to perform searches
like "give me all matches where this field matched by a subregexp has a
value in a range specified by two strings", not only by simple ranges in
classes. Also the collation will support matches by numeric value (advanced
collation). This is another reason why I don't make any difference, at
matching time between classes and unions (they support exactly the same
features, including aggregates, this is just another convenient abbreviated
syntax to mean the same thing): I've generalized the concept by redefining
classes to be unsorted sets of matches for any regexp, instead of just
unsorted sets of matches for only 1 character. A whole regexp can now be
viewed as a class given that all regexps are unions of matches using some
exclusive of non-exclusive conditions.

> This is tedious re-explaining things. My regex matcher is a
> conforming process; it handles input text appropriately regardless
> of normalization. Who ever said that the regular expression itself
> needs to be invariant w.r.t. normalization? I have shown an example
> of a character class that changes meaning if you convert it to NFC
> (see above).

And that's a caveat that I wanted to be able to avoid: I want full
conformance for the regexp process, not just for the searched text inputs
but also the regexp inputs.

> > You could also disable finding the canonical equivalences by using
> another
> > flag, but then you must do it consistently, by disabling it BOTH in the
> > input texts AND in the input regexp, but NOT only in texts like what you
> > have done and you propose.
>
> I didn't propose that. (If I did, it was a mis-statement.)

May be I mis-understood your statement then. You have just said that the
conversion of a regexp to NFC will change the set of matches, so this is not
conforming...

> There is a minimum amount of complexity you need to deal with, and
> I chose to use normalization instead of trying to figure out the
> complete list of possible canonical equivalents (which if you have
> n combining characters can explode as n!).

I know that. My implementation avoids normalization steps in most cases,
except when explicitly requested by using transition rules that depend on
collation elements. This gives much more flexibility.

> > It's not up to regexps to make normalizations, but it's up to regexps to
> be
> > able to recognize classes of canonically equivalent texts and find
> identical
> > sets of matches in this case if they want to be Unicode compliant
> processes.
>
> I'm worn out....

Well, It's true that the subject is quite complex to explain in clear terms.
If you look into the documentation of PCRE Regexps, the documentation is
most often less clear than experimenting yourself the various cases to
understand what it means.

Notably when looking at restartable conditions, or the rules for matching
the leftmost longest sub-regexps and controlling how the fields are matched
and paired in regexps like /(x+)(x*)/ : how many "x" will you match in the
first field and how many will you match for the second field when you use
the longest matching rule?

To disambiguate such thing, it requires some tricky syntax, or tricky global
flags submitted to the regexp compiler. Such complication usually does not
occur if you just want to find a list of full regexp matches, but occurs
when you want to support substitution with subfields within the match.

There are also tricky cases in regexps like /(a*b*c*)(\1)/ where you want
the two delimited fields to be equal in a match. This becomes even trickier
when you want to look for two fields that have related but distinct values,
matching some other mutual condition (for this you need restartable
matchers, backward capabilities, and this changes the way you represent the
graph of transitions.

(note for because of the complexity of regexps my model supports, I have
included the possibility to include ignorable comments and layout spaces to
make some parts readable, but most of the complexity will be hidden by the
fact that I support separate definitions for subregexps, that are "imported"
or included in the final regexp, like if it was a program this is enabled
either by a global compilation flag or by a special grouping operator
anywhere in the regexp itself: these comment parts are discarded when the
regexpd are parsed, and don't augment the size in O(n) of the basic vector
representing the compiled regexp as a microprogram).

Next message: Philippe Verdy: "RE: Fish (was Re: Marks)"
Previous message: Mark Davis: "Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
In reply to: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Asmus Freytag: "Re: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Oct 01 2007 - 21:19:05 CST