RE: FYI: Regex paper for UTC

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 13 2007 - 10:38:23 CDT

  • Next message: Raymond Mercier: "Re: New FAQ page"

    Hans Aberg wrote:
    > Envoyé : samedi 13 octobre 2007 16:07
    > À : Mark Davis
    > Cc : Unicode
    > Objet : Re: FYI: Regex paper for UTC
    >
    > On 10 Oct 2007, at 19:38, Mark Davis wrote:
    >
    > > Andy and I put together a paper of recommendations for the UTC at
    > > http://docs.google.com/Doc?id=dfqr8rd5_32kv97tx .
    >
    > I suspect you will run into problems when mixing complement with the
    > Kleene closure (zero or more concatenations) operator (and other
    > operators generating an infinite number of strings in the associated
    > language).
    >
    > So one safe way is to admit the usual set operations on character
    > classes only, and letting regular expressions having only the usual
    > operators acting on character classes.

    This is only true if you use the assumption that the current state of the
    generated DFA should be represented using a single integer type. But if you
    lift this limitation, there's nothing impossible (and in fact you reduce a
    lot the amount of resources needed to build the DFA, and to run it locally
    in NUMA environments with many parallel nodes collaborating on smaller sets
    of data.)

    Change your design to support multi-variable states, and your DFA will
    support parallel architectures more easily, and will scale better to much
    faster performance. In a massively parallel environment (or distributed
    environments), you even don't need any DFA, because the NFA approach is MUCH
    faster and much more scalable.

    Compiling a regexp to a flat DFA was good for single-core single-processor
    systems, but even in this case, it could not solve very complex problems due
    to lack of resource or processing power (or lack of time for compiling the
    DFA dynamically at runtime). Using a basic NFA has the advantage of not
    requiring complex compilation, just a more complex matcher engine whose
    complexity is bounded and grows linearily (instead of exponentially with a
    DFA).

    The NFA approach clearly supports infinite structures in a very clean way
    (without requiring any backtracking using a stack of infinite maximum size,
    as you may assume if you limit your state to a single integer variable).



    This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 10:41:06 CDT