RE: FYI: Regex paper for UTC

From: Philippe Verdy (
Date: Sat Oct 13 2007 - 17:05:57 CDT

  • Next message: Hans Aberg: "Re: FYI: Regex paper for UTC"

    Hans Aberg [] wrote:
    > Mostly, one would use the set difference \ operator, rather than the
    > complement ~.

    Of course yes, but in fact I have implemented my regexp by converting the
    "&" and "\" operators using the negation ~ and alternations with |.

    All my regexps are first normalized to alternations and negations ONLY
    (instead of putting the negation at the beginning of the regexp ONLY and
    using multiple binary set operators "|", "&", "\" like suggested in the
    recent paper submitted to the UTC).

    This greatly simplifies the optimization of the transition graph, where I
    reduce and reorder the number of transitions per node in the graph to reduce
    the number of tests to perform at each state.

    In my implementation, ALL characters of the alphabet have a outgoing
    transition to another node: either a positive transition to the next node,
    or a transition to the initial node of the subgraph, depending on the
    transition test; depending on the number of input symbols that each
    transition can match, I will reverse the test so that the positive condition
    will not branch in the graph, going simply to the next numbered node, and
    the negative branch will skip over the sequence; in addition, I have
    reordered the transitions so that the first transition in the graph is the
    one that matches the largest number of possible input symbols in the

    The generated graph becomes a linear program similar to assembly language
    containing coded instructions like:

    0: ADVANCE
       COMPARETEST alternateclass1
       JUMPIFTRUE found
       COMPARETEST alternateclass2
       JUMPIFFALSE index2

    And such program is just stored in a simple indexed vector of fixed size,
    the maximum size being predicable and finite in k.O(n) where n is the length
    of the input regexp, and k is below 4 (the worst case where k=4 could be
    reached is almost never seen even in the most complex regexps). Such
    internal binary representation (used for regexps dynamically feeeded to the
    regexp compiler at runtime just before performing the actual matches) is
    easily convertible to a C or assembly program (with a small difficulty for
    Java, solved for now by using a switch in a loop for jump labels) which
    won't need any interpret loop (however I have not worked that part
    extensively, I have just demonstrated it was possible to do it). This
    representation offers excellent performances at run-time due to data
    locality and a heuristic branch prediction based on computed probabilities
    for each alternative.

    This representation also allows implementing even the most complex character
    classes like \p{gc=L|No|Pi|Pf} or custom, userdefined conditions depending
    on the value of the already matched substring (for now however I just use
    predefined conditions because it's easy to perform branch predictions).

    This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 17:07:57 CDT