Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Sam Mason (
Date: Sun Apr 26 2009 - 15:18:57 CDT

  • Next message: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    On Sun, Apr 26, 2009 at 10:57:49AM -0700, Mark Davis wrote:
    > I'd disagree about that. It is certainly simpler to always process as code
    > points, but where performance is required, you need to design your
    > algorithms with the encoding of the core string representation in mind,
    > typically UTF-8 or UTF-16. You can get huge speedups in that way.

    Are there any pointers to literature about that? I'd be interested
    to see how this sort of scheme would hang together; there would seem
    to be quite a trade-off between instruction cache pressure, branch
    prediction and most probably other effects I can't think of at the
    moment. Correctness would seem to have suddenly got much harder to
    demonstrate so this sort of thing would only be reasonable for very
    specialised libraries, which does seem to be what the OP was about.


    This archive was generated by hypermail 2.1.5 : Sun Apr 26 2009 - 15:23:14 CDT