# RE: FYI: Regex paper for UTC

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 13 2007 - 17:05:57 CDT

• Next message: Hans Aberg: "Re: FYI: Regex paper for UTC"

Hans Aberg [mailto:haberg@math.su.se] wrote:
> Mostly, one would use the set difference \ operator, rather than the
> complement ~.

Of course yes, but in fact I have implemented my regexp by converting the
"&" and "\" operators using the negation ~ and alternations with |.

All my regexps are first normalized to alternations and negations ONLY
(instead of putting the negation at the beginning of the regexp ONLY and
using multiple binary set operators "|", "&", "\" like suggested in the
recent paper submitted to the UTC).

This greatly simplifies the optimization of the transition graph, where I
reduce and reorder the number of transitions per node in the graph to reduce
the number of tests to perform at each state.

In my implementation, ALL characters of the alphabet have a outgoing
transition to another node: either a positive transition to the next node,
or a transition to the initial node of the subgraph, depending on the
transition test; depending on the number of input symbols that each
transition can match, I will reverse the test so that the positive condition
will not branch in the graph, going simply to the next numbered node, and
the negative branch will skip over the sequence; in addition, I have
reordered the transitions so that the first transition in the graph is the
one that matches the largest number of possible input symbols in the
alphabet.

The generated graph becomes a linear program similar to assembly language
containing coded instructions like:

0: ADVANCE
COMPARETEST alternateclass1
JUMPIFTRUE found
COMPARETEST alternateclass2
JUMPIFFALSE index2
ADVANCE
Index1:
...
Found:

And such program is just stored in a simple indexed vector of fixed size,
the maximum size being predicable and finite in k.O(n) where n is the length
of the input regexp, and k is below 4 (the worst case where k=4 could be
reached is almost never seen even in the most complex regexps). Such
internal binary representation (used for regexps dynamically feeeded to the
regexp compiler at runtime just before performing the actual matches) is
easily convertible to a C or assembly program (with a small difficulty for
Java, solved for now by using a switch in a loop for jump labels) which
won't need any interpret loop (however I have not worked that part
extensively, I have just demonstrated it was possible to do it). This
representation offers excellent performances at run-time due to data
locality and a heuristic branch prediction based on computed probabilities
for each alternative.

This representation also allows implementing even the most complex character
classes like \p{gc=L|No|Pi|Pf} or custom, userdefined conditions depending
on the value of the already matched substring (for now however I just use
predefined conditions because it's easy to perform branch predictions).

This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 17:07:57 CDT