RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Philippe Verdy ([email protected])
Date: Tue Oct 09 2007 - 18:11:22 CDT

Next message: Rick McGowan: "Public Review Issue Update - UTS #10: Unicode Collation Algorithm"

Previous message: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
In reply to: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Next in thread: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg [mailto:[email protected]] wrote:
> REs can be rewritten into DFAs, and vice versa. And taking the
> language complement of a DFA is easier than that of a RE.

May be, but the production of the DFA is prohibitive. I've implemented
negated regexps without flattening the transition graph into asingle-sate
DFA.

In fact, my implementation can be seen as a DFA too (it just has more state
variables, but I can determine which one to use in a deterministic way and a
bounded time that does not depend much on the regexp length, but only on the
number of alternations in the source regexp). It does not
explodecombinatorially like traditional DFA implementation and the needed
resource for storing the complied transition graph only grows linearily, not
exponsentially, with the source regexp length, meaning that it stays within
the limits of the data cache, and avoids almost all misses (the matcher
benefits from data locality). And I don't need any backtracking.

Anyway, you're still not replying to the question: which is the expected
ordering for the returned matches? I absolutely don't care if you use a DFA
or not. This is important because it affects the way a matcher that needsto
recognize negated regexps will return its results (because if you want the
"complementary", the order of analysis of the transition graph changes
completely to keep the condition about the order of results.

(1) the fact that the transitions from a node in the graph are fully
determined from a single variable does NOT affect the system, given that
even in a DFA there's an order (and in fact you can't build easily (without
exploding the graph size); the full determination of transitions that
allowpossible matches can't be determined without first testing them ALL one
by one, due to the dynamic character classes and to the counted aggregates,
in order to get the full set of transition to use and advance; and this is
NOT backtracking as the decision is based ONLY on the current atomic
symbol).

(2) negating a regexp (or subrexp) also affects its quantification (for its
terminal aggregate), when the terminal node in the NFA contains a loop (so
if you have a priority based on longest matches for the output, the negated
regexp will need to reverse this priority for its smallest matches.

Next message: Rick McGowan: "Public Review Issue Update - UTS #10: Unicode Collation Algorithm"
Previous message: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
In reply to: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Next in thread: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Oct 09 2007 - 18:14:06 CDT