Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sun Apr 26 2009 - 15:24:29 CDT

  • Next message: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    On 4/26/2009 8:40 AM, Doug Ewell wrote:
    > From: "Bjoern Hoehrmann" <derhoermi@gmx.net>
    >
    >> Now, if we replace each character by its UTF-8 encoding, we would ob-
    >> tain a regular expression and corresponding automata that match the
    >> same language, but would operate directly on bytes:
    >>
    >> /(A|B|...|a|b|...|\xC3\x80|...)(...)/
    >
    > I know this isn't the answer you're looking for, but it almost always
    > makes more sense to decode UTF-8 code units into Unicode code points
    > FIRST and then apply other algorithms to operate on Unicode text,
    > instead of trying to build UTF-8 decoding into every algorithm.
    >
    I respectfully disagree.

    For small amounts of data, and for applications that need to handle
    multiple data formats/encodings, it makes sense indeed to first convert
    into a common format and then implement the algorithm only once.

    However, when you need to scan (in real time) large amounts of data
    known to be in UTF-8, the conversion costs will kill you. In my
    consulting practice I've come across cases where that matters.

    These days, I'm working on an upgrade to Unibook
    (http://unicode.org/unibook) that can read information from the Unihan
    data base (>1 million lines of UTF-8). Supporting UTF-8 by conversion
    proved unacceptably slow for use in an interactive environment.

    I investigated a number of optimizations. The big ones included
    reimplementing the tokenizer to work directly on UTF-8, and limiting the
    conversion to data that are later used as strings in formatting and
    display. With that, Unibook can read an un-preprocessed Unihan DB fast
    enough, so that in can take in the background during startup.

    If I understand him correctly, Bjoern also suggests his method to give
    yet another avenue for Unicode-enabling of existing multi-byte aware
    applications. Depending on the circumstances in each case, such retrofit
    might make sense.

    Having a larger toolbox is always nice.

    A./



    This archive was generated by hypermail 2.1.5 : Sun Apr 26 2009 - 15:28:13 CDT