Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Mon Apr 27 2009 - 08:52:41 CDT

Next message: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

Previous message: John (Eljay) Love-Jensen: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Hans Aberg: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

* Asmus Freytag wrote:
>If I understand him correctly, Bjoern also suggests his method to give
>yet another avenue for Unicode-enabling of existing multi-byte aware
>applications. Depending on the circumstances in each case, such retrofit
>might make sense.

Yes. You can transform a grammar as a pre-processing step and then use
the grammar without making other changes to the application, or none at
all if you pre-process the grammar before using it with an application.

Modifying an application so it decodes UTF-8 streams and then operates
on the scalar values is considerably more complicated; you would likely
use two code paths for byte-level and Unicode processing, and you need
new data structures for Unicode character classes, for instance.

As performance is concerned, there appears to be little published com-
parative research into this problem. I hope my implementation may aid
in changing that.

-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Next message: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Previous message: John (Eljay) Love-Jensen: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Hans Aberg: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Apr 27 2009 - 08:56:43 CDT