Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Sam Mason (sam@samason.me.uk)
Date: Wed Apr 29 2009 - 06:38:58 CDT

  • Next message: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    On Wed, Apr 29, 2009 at 01:39:47AM +0200, Bjoern Hoehrmann wrote:
    > * Sam Mason wrote:
    > >Have you seen work such as u8u16[1]? It optimises code specifically
    > >for newer instruction sets (MMX, SSE and AltiVec) resulting in higher
    > >performance (about five times faster than your code, and ~20 times
    > >faster than iconv, if I'm reading things correctly).
    >
    > I have not looked at the implementation, but the report claims on a
    > Pentium 4 with some version of iconv the following throughput rates
    >
    > ASCII-only: u8u16/SSE 19.0 : 1.0 iconv
    > German Wikipedia XML dump: u8u16/SSE 4.4 : 1.0 iconv
    > Arabic Wikipedia XML dump: u8u16/SSE 2.3 : 1.0 iconv
    > Japanese Wikipedia XML dump: u8u16/SSE 2.0 : 1.0 iconv
    >
    > And on a Intel Core Duo:
    >
    > ASCII-only: u8u16/SSE 25.8 : 1.0 iconv
    > German Wikipedia XML dump: u8u16/SSE 6.6 : 1.0 iconv
    > Arabic Wikipedia XML dump: u8u16/SSE 3.6 : 1.0 iconv
    > Japanese Wikipedia XML dump: u8u16/SSE 2.8 : 1.0 iconv
    >
    > My own tests suggest it's not difficult to achieve a throughput
    > of 2-5 times that of the two iconv versions I've tried; my code
    > would be at 5.3 times the rate of iconv for the Hindi Wikipedia
    > on my system, for example.

    Huh, what was innovative about their work then?

    Table 3 of the paper I linked says that 83% of their Japanese dataset
    consisted of 3byte UTF-8 encoded characters. I just hacked some code
    together to look at the distribution characters (before remembering that
    I think you quoted this, doh!) and it came out as:

      ASCII 67.3% (864730778 chars)
      2byte 1.2% ( 14949330 chars)
      3byte 31.5% (404729043 chars)
      4byte <0.1% ( 23228 chars)

    Somewhat different distribution, but still given the small difference
    between their Arabic and Japanese datasets I'm not sure if this would
    make any appreciable difference.

    So either iconv is slower between when their paper was written and
    you tested your code (and hence the speedup attained by your code is
    greater) or their code isn't very amazing!

    -- 
      Sam  http://samason.me.uk/
    


    This archive was generated by hypermail 2.1.5 : Wed Apr 29 2009 - 06:42:47 CDT