RE: FYI: Regex paper for UTC

From: Philippe Verdy ([email protected])
Date: Sat Oct 13 2007 - 18:24:36 CDT

Next message: Philippe Verdy: "RE: FYI: Regex paper for UTC"

Previous message: Doug Ewell: "Re: New FAQ page"
In reply to: Hans Aberg: "Re: FYI: Regex paper for UTC"
Next in thread: Hans Aberg: "Re: FYI: Regex paper for UTC"
Reply: Hans Aberg: "Re: FYI: Regex paper for UTC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> -----Message d'origine-----
> De : Hans Aberg [mailto:[email protected]]
> Envoyé : dimanche 14 octobre 2007 00:49
> À : [email protected]
> Cc : 'Mark Davis'; 'Unicode'
> Objet : Re: FYI: Regex paper for UTC
>
> On 14 Oct 2007, at 00:05, Philippe Verdy wrote:
>
> > All my regexps are first normalized to alternations and negations ONLY
> > (instead of putting the negation at the beginning of the regexp
> > ONLY and using multiple binary set operators "|", "&", "\" like
> > suggested in the recent paper submitted to the UTC).
>
> The definition in terms of a language L(x) has the advantage of
> avoiding that problem, and being theoretically clearer exactly as to
> what is being defined.

Possible, but I won't support it when a match() method is running against
text. The compile() method will rewrite the regexps very simply using
negations and alternations. And I won't forbid using negation directly, so
writing "a -- b", or "a && ~b", or even "~b && a" will have exactly
equivalent behaviours and exactly the same performance... (supporting "&&"
or "--" will just be a facility).

For the rewrite rules, I just need 3 of them, not 10:

A -- B ==> A || ~B
A && B ==> ~(~A || ~B)
~~A ==> A

(parentheses are already eliminated by parsing, the rewrites are actually
not performed on the regexp input string, but on the transition graph
generated directly from the initial parsing).

In fact I support other rewriting rules like:
[ghiabcdej] ==> (a || b || c || d || e || g || h || i || j)
==> ((a \- e) || (g \- j))
to reorder and group classes heuristically (I can match ranges directly
instead of matching single characters, like in the following syntax:
[\q{00}-\q{99}] ==> 00 \- 99

And I support collation order in tweakable various locales to give meaning
to the previous notation, using ranges of strings with bounded length like
above or unbounded length like below (where I use collation in the C locale
tailored by numeric order, the value being interpreted in that locale):

(?{L=C.number} [5-\q{[0-9]+}] ) ==> (?{L=C.number} 5 \- (0 \- 9)+ )

Or simply "[5-\q{[0-9]+}]" using an external locale which may or may not be
tailoring the collation order.

The syntax will be fixed later after more tests about what can be supported
predictably. For now I support custom returns to validate partial matches
using filtered output on captures:

([0-9]+){?}

(which is much simpler, and will probably be faster, even though I can't
specify the condition of transition within the regexp itself)

Next message: Philippe Verdy: "RE: FYI: Regex paper for UTC"
Previous message: Doug Ewell: "Re: New FAQ page"
In reply to: Hans Aberg: "Re: FYI: Regex paper for UTC"
Next in thread: Hans Aberg: "Re: FYI: Regex paper for UTC"
Reply: Hans Aberg: "Re: FYI: Regex paper for UTC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 18:26:27 CDT