RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Philippe Verdy (
Date: Fri Oct 05 2007 - 22:09:24 CDT

  • Next message: N. Ganesan: "Blackberry"

    Andy Heninger wrote:
    > On 10/4/07, Mike <> wrote:
    > > Should the set [^xyz\q{ch}]match the 'c' in "ch"?
    > I don't think so; since the \q{ch} matches "ch", the negated set does
    > not match at the first position.
    > The choices you have made seem reasonable to me.
    > But what would implementations with a DFA (non-backtracking)
    > implementation do? It would be very difficult for them to
    > not take a shorter string > from a set if that led to an overall
    > longer match. Would it be OK - still useful- if the UTS left
    > what happens unspecified?

    You seem to assume some difference between a DFA and NFA implementation, but
    this classification (Deterministic or Non-deterministic Finite-state
    Automatas) belongs more to some classes of languages that are defined with a
    single-variable state for determining its behaviour (i.e. determining the
    transition within a set of multiple candidate transitions or redictions).

    Although it is possible to parse some regexps using such single-variable
    state, the class of languages that you can parse with such automata is much
    more restricted than what Regexp's can recognize: these languages are LR or
    LR(n) languages (or its minor variant LALR that requires just one symbol of
    look-ahead to determine the transition to apply in a deterministic way
    without using backtracking). But for Regexps in the general case, you can't
    make such simplification unless you fix a maximum bound to the value of 'n'
    in the LR(n) class.

    And the highest the value of 'n', the highest the number of states you need
    in the transition graph. For Regexps, this rapidly explodes in a
    combinatorial way, so that does not work.

    To lift the restriction in the value of 'n', you have to abandon the
    restriction of single state variables. Instead, it becomes rapidly necessary
    to allow multiple states to be active *at the same time* within the
    transition graph, and text and perform the tests of *all* candidate
    transitions from the same state and at the same time. This way you get the
    expressiveness of LR(infinite) languages, which is exactly what you need to
    support general regexps.

    And if you do that, your automata is fully deterministic, uses a single (but
    compound) state represented by the individual state of activity of the
    transition graph, whose size *does not* grow in a combinatorial way (in fact
    with such implementation, if the source string representing the regexp has a
    length of 'n' character, the transition graph can be built with *at most*
    (k1 n) nodes (k1 can easily be lower than 3) and *at most* (2n)
    transitions. And this graph can be built in linear time.

    The only cost is when running the engine for matches: the time for each
    transition becomes *at worst* linearily bound to the total length (n) of the
    regexp. But in most applications, regexps are short, even if their
    expressiveness is very large (and would take a about n^(n-1) nodes in the
    transition graph of a DFA implementation).

    Choosing DFA is then certainly the worst approach: it can't go beyond the
    simplest regexps (such as matching substrings with very few alternatives,
    and NO unbounded aggregates that include unlimited loops in the NFA
    transition graph).

    As soon as you must allow more than just 'x?' and 'x{m,n}' aggregates (with
    low values of 'n' here), for example by allowing 'x+' or 'x*' or 'x{m,}',
    you must abandon the DFA approach, or you need to use compound state
    variables. And if you do that, you don't need any backtracking (backtracking
    will be used only to handle syntax errors in languages where it is useful,
    or to evaluate conditional transitions that can't be determined only by the
    current node position in the transition graph and by the current input).

    Thanks, Regexp recognizers can be built using a DFA approach *without any*
    needed backtracking, provided that you choose the correct representation of
    the state variables in your automata so that they can work *directly* based
    on the NFA-style transition graph: each node in the NFA-style transition
    graph will not be the single state variable for your DFA automata, because
    multiple states may be "active" candidates. But the set of active nodes
    perfectly defines the state of the DFA automata.

    If your automata will only accept a limited number of state variables (say
    'v'), then the regexps it will be able to recognize are in the class of
    'LR(v)' languages, and will not parse all possible regexps containing
    aggregates whose upper bound is larger than 'v', and then only you'll need
    the support of backtracking (which just complicates things and that will
    seriously slow down your automata implementation).

    So forget the idea of supporting backtracking when implementing regexps:
    they are used only for handling syntax errors in context-free grammars (like
    those recognized by yacc), but regexps are NOT context-free grammars.

    Note: the paradox here is that syntax of a regexp string can be defined and
    parsed using a context free grammar like yacc. But this does not mean that
    the language that the regexp describes and that you'll need to match with it
    is context-free. In fact it is not!

    And this becomes extremely visible if you also want to support capturing
    groups in regexps, notably for supporting the use of regexps for automated
    replacement of a match by a string built from parts of a match, in regexps
    like "(x*)(x*)" with two capturing groups: you need to use something better
    than backtracking for handling multiple possibilities, and this includes the
    possibility to use compound states, or to augment the regexp with some
    "sugar" to specify which capturing group has the priority (you can't *only*
    use the simple rule of the longest match, as this will be applicable to the
    only one of the two capturing groups in "(x*)(x*)").

    This archive was generated by hypermail 2.1.5 : Fri Oct 05 2007 - 22:13:54 CDT