Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Sam Mason (
Date: Mon Apr 27 2009 - 09:51:07 CDT

  • Next message: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    On Mon, Apr 27, 2009 at 04:28:00PM +0200, Bjoern Hoehrmann wrote:
    > * Sam Mason wrote:
    > >On Sun, Apr 26, 2009 at 10:57:49AM -0700, Mark Davis wrote:
    > >> I'd disagree about that. It is certainly simpler to always process as code
    > >> points, but where performance is required, you need to design your
    > >> algorithms with the encoding of the core string representation in mind,
    > >> typically UTF-8 or UTF-16. You can get huge speedups in that way.
    > >
    > >Are there any pointers to literature about that? I'd be interested
    > >to see how this sort of scheme would hang together; there would seem
    > >to be quite a trade-off between instruction cache pressure, branch
    > >prediction and most probably other effects I can't think of at the
    > >moment.
    > To offer one data point, Python's UTF-8 decoder does this and I've timed
    > it against a couple of others. I note that the decoder accepts surrogate
    > code points so the comparison is not entirely fair.

    The point I was responding to was about doing regex's directly on the
    data (i.e. UTF-8 when the source is UTF-8, UTF-16 when the source is
    UTF-16, etc.) and not attempting to convert to a standard internal
    encoding first.

    > The times are at:
    > Transcoding my name from UTF-8 to UTF-16 is substantially faster due to
    > this optimization. That is largely because my processor can predict the
    > additional branches almost perfectly (16 byte string, the pattern for
    > the check is simply alternating no yes no yes). For the other data sets
    > Python's decoder is slower than my own without any such optimization.
    > I've also added a similar optimization to my own version, using the
    > aligned 4 byte read and aligned 16 byte reads using SSE2 instructions.
    > The results were similar (that is, for the more complex data sets,
    > perfomance got worse).

    Have you seen work such as u8u16[1]? It optimises code specifically
    for newer instruction sets (MMX, SSE and AltiVec) resulting in higher
    performance (about five times faster than your code, and ~20 times
    faster than iconv, if I'm reading things correctly). They've patented
    it for commercial use and left it free for FOSS use which is OK, but a
    tad annoying.

    It would be nice if their Parallel Bit Streams paper[2] went into more
    details, I've only given it a quick scan and didn't see the sort of
    details that interested me anyway!


    This archive was generated by hypermail 2.1.5 : Mon Apr 27 2009 - 09:53:58 CDT