Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Wed Apr 29 2009 - 07:06:00 CDT

  • Next message: Hans Aberg: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    * Hans Aberg wrote:
    >On 26 Apr 2009, at 07:01, Bjoern Hoehrmann wrote:
    >> This is the first problem that the module solves: you can pass it a
    >> character class and it would give you a short regular expression that
    >> matches UTF-8 encoded strings that encode one of the characters in the
    >> class.
    >
    >Is this something similar to
    > http://lists.gnu.org/archive/html/help-flex/2005-01/msg00043.html

    Yes. The differences are that I do not consider just individual ranges
    but sets of ranges and build minimal (as possible) DFAs from them, then
    convert the automaton to regular expressions if needed. With my code you
    can also pass in multiple disjoint sets and get an automaton that has a
    different final state for each set. That allows to relatively quickly
    determine which set each character in a string belongs to.

    -- 
    Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
    Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
    


    This archive was generated by hypermail 2.1.5 : Wed Apr 29 2009 - 07:09:14 CDT