Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Mon Apr 27 2009 - 09:28:00 CDT

Next message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

Previous message: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

* Sam Mason wrote:
>On Sun, Apr 26, 2009 at 10:57:49AM -0700, Mark Davis wrote:
>> I'd disagree about that. It is certainly simpler to always process as code
>> points, but where performance is required, you need to design your
>> algorithms with the encoding of the core string representation in mind,
>> typically UTF-8 or UTF-16. You can get huge speedups in that way.
>
>Are there any pointers to literature about that? I'd be interested
>to see how this sort of scheme would hang together; there would seem
>to be quite a trade-off between instruction cache pressure, branch
>prediction and most probably other effects I can't think of at the
>moment.

To offer one data point, Python's UTF-8 decoder does this and I've timed
it against a couple of others. I note that the decoder accepts surrogate
code points so the comparison is not entirely fair. The times are at:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#performance

Transcoding my name from UTF-8 to UTF-16 is substantially faster due to
this optimization. That is largely because my processor can predict the
additional branches almost perfectly (16 byte string, the pattern for
the check is simply alternating no yes no yes). For the other data sets
Python's decoder is slower than my own without any such optimization.

I've also added a similar optimization to my own version, using the
aligned 4 byte read and aligned 16 byte reads using SSE2 instructions.
The results were similar (that is, for the more complex data sets,
perfomance got worse).

-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Next message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Previous message: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Apr 27 2009 - 09:31:00 CDT