Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Sam Mason (sam@samason.me.uk)
Date: Wed Apr 29 2009 - 06:38:58 CDT

Next message: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

Previous message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Wed, Apr 29, 2009 at 01:39:47AM +0200, Bjoern Hoehrmann wrote:
> * Sam Mason wrote:
> >Have you seen work such as u8u16[1]? It optimises code specifically
> >for newer instruction sets (MMX, SSE and AltiVec) resulting in higher
> >performance (about five times faster than your code, and ~20 times
> >faster than iconv, if I'm reading things correctly).
>
> I have not looked at the implementation, but the report claims on a
> Pentium 4 with some version of iconv the following throughput rates
>
> ASCII-only: u8u16/SSE 19.0 : 1.0 iconv
> German Wikipedia XML dump: u8u16/SSE 4.4 : 1.0 iconv
> Arabic Wikipedia XML dump: u8u16/SSE 2.3 : 1.0 iconv
> Japanese Wikipedia XML dump: u8u16/SSE 2.0 : 1.0 iconv
>
> And on a Intel Core Duo:
>
> ASCII-only: u8u16/SSE 25.8 : 1.0 iconv
> German Wikipedia XML dump: u8u16/SSE 6.6 : 1.0 iconv
> Arabic Wikipedia XML dump: u8u16/SSE 3.6 : 1.0 iconv
> Japanese Wikipedia XML dump: u8u16/SSE 2.8 : 1.0 iconv
>
> My own tests suggest it's not difficult to achieve a throughput
> of 2-5 times that of the two iconv versions I've tried; my code
> would be at 5.3 times the rate of iconv for the Hindi Wikipedia
> on my system, for example.

Huh, what was innovative about their work then?

Table 3 of the paper I linked says that 83% of their Japanese dataset
consisted of 3byte UTF-8 encoded characters. I just hacked some code
together to look at the distribution characters (before remembering that
I think you quoted this, doh!) and it came out as:

  ASCII 67.3% (864730778 chars)
  2byte 1.2% ( 14949330 chars)
  3byte 31.5% (404729043 chars)
  4byte <0.1% ( 23228 chars)

Somewhat different distribution, but still given the small difference
between their Arabic and Japanese datasets I'm not sure if this would
make any appreciable difference.

So either iconv is slower between when their paper was written and
you tested your code (and hence the speedup attained by your code is
greater) or their code isn't very amazing!

-- 
  Sam  http://samason.me.uk/

Next message: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Previous message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Apr 29 2009 - 06:42:47 CDT