RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Philippe Verdy (
Date: Tue Oct 09 2007 - 18:11:22 CDT

  • Next message: Rick McGowan: "Public Review Issue Update - UTS #10: Unicode Collation Algorithm"

    Hans Aberg [] wrote:
    > REs can be rewritten into DFAs, and vice versa. And taking the
    > language complement of a DFA is easier than that of a RE.

    May be, but the production of the DFA is prohibitive. I've implemented
    negated regexps without flattening the transition graph into asingle-sate

    In fact, my implementation can be seen as a DFA too (it just has more state
    variables, but I can determine which one to use in a deterministic way and a
    bounded time that does not depend much on the regexp length, but only on the
    number of alternations in the source regexp). It does not
    explodecombinatorially like traditional DFA implementation and the needed
    resource for storing the complied transition graph only grows linearily, not
    exponsentially, with the source regexp length, meaning that it stays within
    the limits of the data cache, and avoids almost all misses (the matcher
    benefits from data locality). And I don't need any backtracking.

    Anyway, you're still not replying to the question: which is the expected
    ordering for the returned matches? I absolutely don't care if you use a DFA
    or not. This is important because it affects the way a matcher that needsto
    recognize negated regexps will return its results (because if you want the
    "complementary", the order of analysis of the transition graph changes
    completely to keep the condition about the order of results.

    (1) the fact that the transitions from a node in the graph are fully
    determined from a single variable does NOT affect the system, given that
    even in a DFA there's an order (and in fact you can't build easily (without
    exploding the graph size); the full determination of transitions that
    allowpossible matches can't be determined without first testing them ALL one
    by one, due to the dynamic character classes and to the counted aggregates,
    in order to get the full set of transition to use and advance; and this is
    NOT backtracking as the decision is based ONLY on the current atomic

    (2) negating a regexp (or subrexp) also affects its quantification (for its
    terminal aggregate), when the terminal node in the NFA contains a loop (so
    if you have a priority based on longest matches for the output, the negated
    regexp will need to reverse this priority for its smallest matches.

    This archive was generated by hypermail 2.1.5 : Tue Oct 09 2007 - 18:14:06 CDT