Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Sam Mason (sam@samason.me.uk)
Date: Sun Apr 26 2009 - 15:18:57 CDT

Next message: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

Previous message: Mark Davis: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Mark Davis: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Sun, Apr 26, 2009 at 10:57:49AM -0700, Mark Davis wrote:
> I'd disagree about that. It is certainly simpler to always process as code
> points, but where performance is required, you need to design your
> algorithms with the encoding of the core string representation in mind,
> typically UTF-8 or UTF-16. You can get huge speedups in that way.

Are there any pointers to literature about that? I'd be interested
to see how this sort of scheme would hang together; there would seem
to be quite a trade-off between instruction cache pressure, branch
prediction and most probably other effects I can't think of at the
moment. Correctness would seem to have suddenly got much harder to
demonstrate so this sort of thing would only be reasonable for very
specialised libraries, which does seem to be what the OP was about.

-- 
  Sam  http://samason.me.uk/

Next message: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Previous message: Mark Davis: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Mark Davis: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Apr 26 2009 - 15:23:14 CDT