Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Sam Mason (
Date: Tue Apr 28 2009 - 06:21:30 CDT

  • Next message: Andreas Prilop: "Re: Bidi demo"

    On Mon, Apr 27, 2009 at 10:24:03AM -0700, Asmus Freytag wrote:
    > On 4/27/2009 5:09 AM, John (Eljay) Love-Jensen wrote:
    > >Rather than specifying the optimized regular expression in native UTF-8 in
    > >the first place, and perhaps another in UTF-16BE, and perhaps another in
    > >yada yada...
    > >
    > >That would avoid the brittleness issue raised by others.
    > That's a good point and a bit orthogonal to what I was trying to
    > highlight. My focus was on calling attention to the fact that multi-step
    > implementations with separate and independent phases for conversion and
    > algorithmic text processing can be cost-prohibitive in high-volume
    > (real-time) applications. Such application domains exist and are real
    > scenarios, even though they are not the standard case.

    High-volume and real-time are not normally compatible; either you
    want to optimise for the common case and ensure that it's fast (and
    hence you're good for high-volume) or you optimise to make sure that
    worst case response time is above some cut-off (and you can give
    real-time guarantees). These are very different constraints and call
    for different implementations.

    To address your main point; it's reasonable to perform the conversion at
    the same time as processing is done. Python, say, has generators which
    would allow you to arrange for the conversion to be done on the fly,
    this idiom can be translated into various other imperative languages.
    If intermediate buffers were kept small, performance could be very good
    as all data would be processed in L1 cache. This wouldn't seem to help
    the real-time case at all though.


    This archive was generated by hypermail 2.1.5 : Tue Apr 28 2009 - 06:27:05 CDT