Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Tue Apr 28 2009 - 18:39:47 CDT

  • Next message: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    * Sam Mason wrote:
    >Have you seen work such as u8u16[1]? It optimises code specifically
    >for newer instruction sets (MMX, SSE and AltiVec) resulting in higher
    >performance (about five times faster than your code, and ~20 times
    >faster than iconv, if I'm reading things correctly).

    I have not looked at the implementation, but the report claims on a
    Pentium 4 with some version of iconv the following throughput rates

                       ASCII-only: u8u16/SSE 19.0 : 1.0 iconv
        German Wikipedia XML dump: u8u16/SSE 4.4 : 1.0 iconv
        Arabic Wikipedia XML dump: u8u16/SSE 2.3 : 1.0 iconv
      Japanese Wikipedia XML dump: u8u16/SSE 2.0 : 1.0 iconv

    And on a Intel Core Duo:

                       ASCII-only: u8u16/SSE 25.8 : 1.0 iconv
        German Wikipedia XML dump: u8u16/SSE 6.6 : 1.0 iconv
        Arabic Wikipedia XML dump: u8u16/SSE 3.6 : 1.0 iconv
      Japanese Wikipedia XML dump: u8u16/SSE 2.8 : 1.0 iconv

    My own tests suggest it's not difficult to achieve a throughput
    of 2-5 times that of the two iconv versions I've tried; my code
    would be at 5.3 times the rate of iconv for the Hindi Wikipedia
    on my system, for example.

    -- 
    Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
    Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
    


    This archive was generated by hypermail 2.1.5 : Tue Apr 28 2009 - 18:42:48 CDT