Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Bjoern Hoehrmann (
Date: Mon Apr 27 2009 - 09:28:00 CDT

  • Next message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    * Sam Mason wrote:
    >On Sun, Apr 26, 2009 at 10:57:49AM -0700, Mark Davis wrote:
    >> I'd disagree about that. It is certainly simpler to always process as code
    >> points, but where performance is required, you need to design your
    >> algorithms with the encoding of the core string representation in mind,
    >> typically UTF-8 or UTF-16. You can get huge speedups in that way.
    >Are there any pointers to literature about that? I'd be interested
    >to see how this sort of scheme would hang together; there would seem
    >to be quite a trade-off between instruction cache pressure, branch
    >prediction and most probably other effects I can't think of at the

    To offer one data point, Python's UTF-8 decoder does this and I've timed
    it against a couple of others. I note that the decoder accepts surrogate
    code points so the comparison is not entirely fair. The times are at:

    Transcoding my name from UTF-8 to UTF-16 is substantially faster due to
    this optimization. That is largely because my processor can predict the
    additional branches almost perfectly (16 byte string, the pattern for
    the check is simply alternating no yes no yes). For the other data sets
    Python's decoder is slower than my own without any such optimization.

    I've also added a similar optimization to my own version, using the
    aligned 4 byte read and aligned 16 byte reads using SSE2 instructions.
    The results were similar (that is, for the more complex data sets,
    perfomance got worse).

    Björn Höhrmann · ·
    Am Badedeich 7 · Telefon: +49(0)160/4415681 ·
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · 

    This archive was generated by hypermail 2.1.5 : Mon Apr 27 2009 - 09:31:00 CDT