Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Asmus Freytag (
Date: Mon Apr 27 2009 - 12:24:03 CDT

  • Next message: Mark Davis: "Bidi demo"

    On 4/27/2009 5:09 AM, John (Eljay) Love-Jensen wrote:
    > Hi Asmus,
    >> I respectfully disagree.
    >> For small amounts of data, and for applications that need to handle
    >> multiple data formats/encodings, it makes sense indeed to first convert
    >> into a common format and then implement the algorithm only once.
    >> However, when you need to scan (in real time) large amounts of data
    >> known to be in UTF-8, the conversion costs will kill you. In my
    >> consulting practice I've come across cases where that matters.
    > Wouldn't it be prudent to have the regular expression expressed in Unicode,
    > and then translate that (for performance on the data stream in the data
    > stream's format) into a UTF-8 one, or UTF-16LE, or UTF-16BE, or UTF-32LE, or
    > UTF-32BE as appropriate?
    > Rather than specifying the optimized regular expression in native UTF-8 in
    > the first place, and perhaps another in UTF-16BE, and perhaps another in
    > yada yada...
    > That would avoid the brittleness issue raised by others.
    That's a good point and a bit orthogonal to what I was trying to
    highlight. My focus was on calling attention to the fact that multi-step
    implementations with separate and independent phases for conversion and
    algorithmic text processing can be cost-prohibitive in high-volume
    (real-time) applications. Such application domains exist and are real
    scenarios, even though they are not the standard case.
    > And have high performance you are looking for -- working on the native data
    > stream without decoding the data stream into UTF-32 (platform native 32-bit)
    > characters.
    > Putting the burden on the regular expression compiler, which would have to
    > be Unicode savvy, and able to optimize the regex into a particular Unicode
    > transformation format.
    That's a straightforward application of the principle that the
    optimizations should be encapsulated. I would certainly not disagree
    with that. It's the way to go in new code.

    One of Bjoern's point was about retrofitting existing implementations.
    You always have the choice of whether you modify them to handle UTF-8
    directly (as yet another byte-oriented encoding) or whether you convert
    them to all UTF-16 or UTF-32 internally. For existing implementations
    the correct choice depends on a large number of variables, including
    expected life-time of the application, whether or not the existing code
    base already handles multi-byte encodings, what types of processing is
    done on the data and how much, what external components need to be
    interfaced with and with what encoding forms, how localized text
    handling is in the architecture, etc., etc.

    Having UTF-8 direct implementations of core algorithms (like character
    classification, regex, etc) at your command allows you to fine tune the
    retrofit. You may think that little code is left that needs to be
    retrofitted, but in my consulting practice I keep coming across fresh


    This archive was generated by hypermail 2.1.5 : Mon Apr 27 2009 - 12:27:53 CDT