UTF-8 based DFAs and Regexps from Unicode sets

From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Sun Apr 26 2009 - 00:01:56 CDT

  • Next message: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    Hello,

      A while ago I published the Perl module Unicode::SetAutomata on CPAN.
    Consider you have a UTF-8 encoded string and wish to know whether it is
    a letter followed by a digit. You could use a finite automaton to check
    for that, it would go like this:

      +---------+ +---------+ +---------+
      | | Letter | | Digit | |
      | Start | ----------> | | ---------> | Final |
      | | | | | |
      +---------+ +---------+ +---------+

    If you just worry about bytes, this would be easily implemented. You
    could just make a simple array that encodes for each byte whether it
    is a letter, a digit, or something else. With the number of allocated
    code points in Unicode, this would require too much space, alternate
    encodings would be needed. Consider a corresponding regular expression:

      /[:UnicodeLetter:][:UnicodeDigit:]/

    Which would expand so something like

      /(A|B|...|a|b|...|À|...)(...)/

    Now, if we replace each character by its UTF-8 encoding, we would ob-
    tain a regular expression and corresponding automata that match the
    same language, but would operate directly on bytes:

      /(A|B|...|a|b|...|\xC3\x80|...)(...)/

    This is the first problem that the module solves: you can pass it a
    character class and it would give you a short regular expression that
    matches UTF-8 encoded strings that encode one of the characters in the
    class. The second problem is a related one. It gives you an automaton
    like this:

                                +---------+
                                | |
                   +--- ... --> | Letter |
                  / | |
                 / +---------+
      +---------+ +---------+
      | | | |
      | Start | ---- ... ---> | Digit |
      | | | |
      +---------+ +---------+
                 \ ###########
                  \ # #
                   +--- ... --> # Other #
                                # #
                                ###########

    That is, you can pass it a couple of character classes, and it gives
    you a deterministic finite automaton with a distinct accepting states
    for each of the (disjoint) character classes. The first automaton
    could then be implemented in a two stage process: pass bytes to the
    character class determining automaton until a character has been read,
    then advance the class-based automaton using that character class.

    There are a number of existing text processing tools that currently
    lack usable Unicode support; the first feature offers a simple way to
    use them with UTF-8 encoded data. The second feature was implemented
    mainly because it solves the first problem for many character classes
    at once, and to study whether such an approach as feasible.

    Currently I do not have the time to actually study that, but a rule
    of thumb seems to be that the 4-tuples returned by the module take
    up about as much space as inversion lists would, assuming that most
    automata would have less than 256 states so each 4-tuple can be en-
    coded as 32 bit integer.

    Generously commented code and some documentation are available at:

      http://search.cpan.org/dist/Unicode-SetAutomaton/

    regards,

    -- 
    Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
    Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
    


    This archive was generated by hypermail 2.1.5 : Sun Apr 26 2009 - 00:07:09 CDT