Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Sam Mason (sam@samason.me.uk)
Date: Tue Apr 28 2009 - 06:21:30 CDT

Next message: Andreas Prilop: "Re: Bidi demo"

Previous message: Mark Davis: "Bidi demo"
In reply to: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Mon, Apr 27, 2009 at 10:24:03AM -0700, Asmus Freytag wrote:
> On 4/27/2009 5:09 AM, John (Eljay) Love-Jensen wrote:
> >Rather than specifying the optimized regular expression in native UTF-8 in
> >the first place, and perhaps another in UTF-16BE, and perhaps another in
> >yada yada...
> >
> >That would avoid the brittleness issue raised by others.
>
> That's a good point and a bit orthogonal to what I was trying to
> highlight. My focus was on calling attention to the fact that multi-step
> implementations with separate and independent phases for conversion and
> algorithmic text processing can be cost-prohibitive in high-volume
> (real-time) applications. Such application domains exist and are real
> scenarios, even though they are not the standard case.

High-volume and real-time are not normally compatible; either you
want to optimise for the common case and ensure that it's fast (and
hence you're good for high-volume) or you optimise to make sure that
worst case response time is above some cut-off (and you can give
real-time guarantees). These are very different constraints and call
for different implementations.

To address your main point; it's reasonable to perform the conversion at
the same time as processing is done. Python, say, has generators which
would allow you to arrange for the conversion to be done on the fly,
this idiom can be translated into various other imperative languages.
If intermediate buffers were kept small, performance could be very good
as all data would be processed in L1 cache. This wouldn't seem to help
the real-time case at all though.

-- 
  Sam  http://samason.me.uk/

Next message: Andreas Prilop: "Re: Bidi demo"
Previous message: Mark Davis: "Bidi demo"
In reply to: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Philippe Verdy: "RE: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Apr 28 2009 - 06:27:05 CDT