Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Hans Aberg (
Date: Wed Apr 29 2009 - 07:16:53 CDT

  • Next message: Andreas Prilop: "Re: Bidi demo"

    On 29 Apr 2009, at 14:06, Bjoern Hoehrmann wrote:

    >> Is this something similar to
    > Yes. The differences are that I do not consider just individual ranges
    > but sets of ranges and build minimal (as possible) DFAs from them,
    > then
    > convert the automaton to regular expressions if needed. With my code
    > you
    > can also pass in multiple disjoint sets and get an automaton that
    > has a
    > different final state for each set. That allows to relatively quickly
    > determine which set each character in a string belongs to.

    The objectives are somewhat different. My idea was not having to
    rewrite Flex entirely, which has 8-byte indexed lookup tables (so 22
    or 32 bit indexed tables would be impractical). The program Flex does
    the conversion from NFAs to DFAs (Regex to NFA is fairly
    straightforward) using various optimizations, and it suffices to have
    simple ranges, as other character classes can built from that.


    This archive was generated by hypermail 2.1.5 : Wed Apr 29 2009 - 07:19:51 CDT