UTF-8 based DFAs and Regexps from Unicode sets

From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Sun Apr 26 2009 - 00:01:56 CDT

Next message: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

Previous message: Andreas Stötzner: "Encoding of symbols"
Next in thread: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Hans Aberg: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hello,

A while ago I published the Perl module Unicode::SetAutomata on CPAN.
Consider you have a UTF-8 encoded string and wish to know whether it is
a letter followed by a digit. You could use a finite automaton to check
for that, it would go like this:

  +---------+ +---------+ +---------+
  | | Letter | | Digit | |
  | Start | ----------> | | ---------> | Final |
  | | | | | |
  +---------+ +---------+ +---------+

If you just worry about bytes, this would be easily implemented. You
could just make a simple array that encodes for each byte whether it
is a letter, a digit, or something else. With the number of allocated
code points in Unicode, this would require too much space, alternate
encodings would be needed. Consider a corresponding regular expression:

/[:UnicodeLetter:][:UnicodeDigit:]/

Which would expand so something like

/(A|B|...|a|b|...|À|...)(...)/

Now, if we replace each character by its UTF-8 encoding, we would ob-
tain a regular expression and corresponding automata that match the
same language, but would operate directly on bytes:

/(A|B|...|a|b|...|\xC3\x80|...)(...)/

This is the first problem that the module solves: you can pass it a
character class and it would give you a short regular expression that
matches UTF-8 encoded strings that encode one of the characters in the
class. The second problem is a related one. It gives you an automaton
like this:

                            +---------+
                            | |
               +--- ... --> | Letter |
              / | |
             / +---------+
  +---------+ +---------+
  | | | |
  | Start | ---- ... ---> | Digit |
  | | | |
  +---------+ +---------+
             \ ###########
              \ # #
               +--- ... --> # Other #
                            # #
                            ###########

That is, you can pass it a couple of character classes, and it gives
you a deterministic finite automaton with a distinct accepting states
for each of the (disjoint) character classes. The first automaton
could then be implemented in a two stage process: pass bytes to the
character class determining automaton until a character has been read,
then advance the class-based automaton using that character class.

There are a number of existing text processing tools that currently
lack usable Unicode support; the first feature offers a simple way to
use them with UTF-8 encoded data. The second feature was implemented
mainly because it solves the first problem for many character classes
at once, and to study whether such an approach as feasible.

Currently I do not have the time to actually study that, but a rule
of thumb seems to be that the 4-tuples returned by the module take
up about as much space as inversion lists would, assuming that most
automata would have less than 256 states so each 4-tuple can be en-
coded as 32 bit integer.

Generously commented code and some documentation are available at:

http://search.cpan.org/dist/Unicode-SetAutomaton/

regards,

-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Next message: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Previous message: Andreas Stötzner: "Encoding of symbols"
Next in thread: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Hans Aberg: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Apr 26 2009 - 00:07:09 CDT