RE: FYI: Regex paper for UTC

From: Philippe Verdy (
Date: Sat Oct 13 2007 - 18:24:36 CDT

  • Next message: Philippe Verdy: "RE: FYI: Regex paper for UTC"

    > -----Message d'origine-----
    > De : Hans Aberg []
    > Envoyé : dimanche 14 octobre 2007 00:49
    > À :
    > Cc : 'Mark Davis'; 'Unicode'
    > Objet : Re: FYI: Regex paper for UTC
    > On 14 Oct 2007, at 00:05, Philippe Verdy wrote:
    > > All my regexps are first normalized to alternations and negations ONLY
    > > (instead of putting the negation at the beginning of the regexp
    > > ONLY and using multiple binary set operators "|", "&", "\" like
    > > suggested in the recent paper submitted to the UTC).
    > The definition in terms of a language L(x) has the advantage of
    > avoiding that problem, and being theoretically clearer exactly as to
    > what is being defined.

    Possible, but I won't support it when a match() method is running against
    text. The compile() method will rewrite the regexps very simply using
    negations and alternations. And I won't forbid using negation directly, so
    writing "a -- b", or "a && ~b", or even "~b && a" will have exactly
    equivalent behaviours and exactly the same performance... (supporting "&&"
    or "--" will just be a facility).

    For the rewrite rules, I just need 3 of them, not 10:

    A -- B ==> A || ~B
    A && B ==> ~(~A || ~B)
    ~~A ==> A

    (parentheses are already eliminated by parsing, the rewrites are actually
    not performed on the regexp input string, but on the transition graph
    generated directly from the initial parsing).

    In fact I support other rewriting rules like:
    [ghiabcdej] ==> (a || b || c || d || e || g || h || i || j)
                ==> ((a \- e) || (g \- j))
    to reorder and group classes heuristically (I can match ranges directly
    instead of matching single characters, like in the following syntax:
    [\q{00}-\q{99}] ==> 00 \- 99

    And I support collation order in tweakable various locales to give meaning
    to the previous notation, using ranges of strings with bounded length like
    above or unbounded length like below (where I use collation in the C locale
    tailored by numeric order, the value being interpreted in that locale):

    (?{L=C.number} [5-\q{[0-9]+}] ) ==> (?{L=C.number} 5 \- (0 \- 9)+ )

    Or simply "[5-\q{[0-9]+}]" using an external locale which may or may not be
    tailoring the collation order.

    The syntax will be fixed later after more tests about what can be supported
    predictably. For now I support custom returns to validate partial matches
    using filtered output on captures:


    (which is much simpler, and will probably be faster, even though I can't
    specify the condition of transition within the regexp itself)

    This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 18:26:27 CDT