Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Mon Apr 27 2009 - 08:52:41 CDT

  • Next message: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    * Asmus Freytag wrote:
    >If I understand him correctly, Bjoern also suggests his method to give
    >yet another avenue for Unicode-enabling of existing multi-byte aware
    >applications. Depending on the circumstances in each case, such retrofit
    >might make sense.

    Yes. You can transform a grammar as a pre-processing step and then use
    the grammar without making other changes to the application, or none at
    all if you pre-process the grammar before using it with an application.

    Modifying an application so it decodes UTF-8 streams and then operates
    on the scalar values is considerably more complicated; you would likely
    use two code paths for byte-level and Unicode processing, and you need
    new data structures for Unicode character classes, for instance.

    As performance is concerned, there appears to be little published com-
    parative research into this problem. I hope my implementation may aid
    in changing that.

    -- 
    Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
    Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
    


    This archive was generated by hypermail 2.1.5 : Mon Apr 27 2009 - 08:56:43 CDT