RE: FYI: Regex paper for UTC

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 13 2007 - 10:38:23 CDT

Next message: Raymond Mercier: "Re: New FAQ page"

Previous message: Philippe Verdy: "RE: New FAQ page"
In reply to: Hans Aberg: "Re: FYI: Regex paper for UTC"
Next in thread: Hans Aberg: "Re: FYI: Regex paper for UTC"
Reply: Hans Aberg: "Re: FYI: Regex paper for UTC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg wrote:
> Envoyé : samedi 13 octobre 2007 16:07
> À : Mark Davis
> Cc : Unicode
> Objet : Re: FYI: Regex paper for UTC
>
> On 10 Oct 2007, at 19:38, Mark Davis wrote:
>
> > Andy and I put together a paper of recommendations for the UTC at
> > http://docs.google.com/Doc?id=dfqr8rd5_32kv97tx .
>
> I suspect you will run into problems when mixing complement with the
> Kleene closure (zero or more concatenations) operator (and other
> operators generating an infinite number of strings in the associated
> language).
>
> So one safe way is to admit the usual set operations on character
> classes only, and letting regular expressions having only the usual
> operators acting on character classes.

This is only true if you use the assumption that the current state of the
generated DFA should be represented using a single integer type. But if you
lift this limitation, there's nothing impossible (and in fact you reduce a
lot the amount of resources needed to build the DFA, and to run it locally
in NUMA environments with many parallel nodes collaborating on smaller sets
of data.)

Change your design to support multi-variable states, and your DFA will
support parallel architectures more easily, and will scale better to much
faster performance. In a massively parallel environment (or distributed
environments), you even don't need any DFA, because the NFA approach is MUCH
faster and much more scalable.

Compiling a regexp to a flat DFA was good for single-core single-processor
systems, but even in this case, it could not solve very complex problems due
to lack of resource or processing power (or lack of time for compiling the
DFA dynamically at runtime). Using a basic NFA has the advantage of not
requiring complex compilation, just a more complex matcher engine whose
complexity is bounded and grows linearily (instead of exponentially with a
DFA).

The NFA approach clearly supports infinite structures in a very clean way
(without requiring any backtracking using a stack of infinite maximum size,
as you may assume if you limit your state to a single integer variable).

Next message: Raymond Mercier: "Re: New FAQ page"
Previous message: Philippe Verdy: "RE: New FAQ page"
In reply to: Hans Aberg: "Re: FYI: Regex paper for UTC"
Next in thread: Hans Aberg: "Re: FYI: Regex paper for UTC"
Reply: Hans Aberg: "Re: FYI: Regex paper for UTC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 10:41:06 CDT