RE: FYI: Regex paper for UTC

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 13 2007 - 17:51:23 CDT

  • Next message: Doug Ewell: "Re: New FAQ page"

    Hans Aberg [mailto:haberg@math.su.se]
    > This is not the language complement. The language set operators are
    > just the normal set operators inside the set C*, where C is the
    > character set. See for example, Aho, Sethi & Ullman, "Compilers" (the
    > "dragon book") on REs.

    I have studied this book 18 years ago, and I still have it. Implementing
    compilers was a significant part of my student projects, and I have used it
    at work many times since then to implement various compiler-like tools (not
    just regexps).

    I don't know why you want to restrict the negation to only a small part of
    what they can do.

    My negation operator (like all other regexp operations) IS operating as a
    injection from a set of positioned strings=(C*, P) to (C*, P) where P is a
    set of positions within strings of C*. It is more general and solves the
    particular case of injections from C* to C* (where P is then only a
    singleton based on a particular rule).

    Note: my engine does not look first for the longest match. Longest matches
    are found by filtering, using customizable rules after each matche (whatever
    its size or position) is found. Why?:

    Consider an input text coming from a stream A that generates an infinite
    suite of characters "a".
    Feed this stream to a filter that will return the longest match for the
    regexp "a*". This filter will never return unless it applies the technic of
    optimization for the conversion of a final loop into a final recursion.
    Now use the same filter but with the regexp "a*b": this time it's impossible
    for the filter to return anything.

    The "smallest leftmost match first" rule guarantees a return after a finite
    number of steps (which does not depend on the input size), and using bounded
    resources (memory, time, number of state variables), even in the two last
    cases above. The bounds for resources needed are also fully predictable only
    from the regexp and they don't grow in a combinatorial way. It will
    enumerate EXACTLY the same matches from non infinite input streams (such as
    text files), but just in a different order.

    And one can still enumerate only the longest matches by filtering
    sequentially the output using a single prefetch variable for the candidate
    matches returned by the previous strategy. In addition this process is
    easily parallelizable to look for matches in very long texts using as many
    processor cores as needed, each one using small amount of local only data
    without synchronization between the running threads.

    Given these advantages (plus the bonus that bounded resources offers for
    system security), I'm sure my strategy is correct, as it does not remove any
    productivity to the classic regexps that are still supported (including
    POSIX ed/sed/vi or POSIX lex, or the proposed Unicode regexps).

    I am even starting to modify my code to support also named subregexps that
    allow matching paired parentheses for example (impossible with traditional
    regexps that can't "count" them):

    (?=paren= (a | [(]\=paren=[)] | [[]\=paren=[]] | | [{]\=paren=[}])+ )

    Or matching equal substrings (like "a+a" or "xyz+xyz" but not "a+b"):

    (?=samename= [a-z]+ ) + \=samename=

    I'm not sure for now about howto encode the operators syntax, I'm looking
    for other references... It will require adding a stack in the engine state,
    for handling the case of non-initial and non-trailing recursive regexps
    (like the "paren" definition above); for non-recursive named regexps, or
    left-recursive or right-recursive regexps, they can be rewritten using
    normal counters like "+" and "*".



    This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 17:52:46 CDT