Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Doug Ewell (doug@ewellic.org)
Date: Tue Apr 28 2009 - 22:29:15 CDT

  • Next message: Mark Davis: "Re: [bidi] Bidi demo"

    Asmus Freytag <asmusf at ix dot netcom dot com> wrote:

    > For UTF-8 there are many tasks where conversion can be entirely
    > avoided, or can be avoided for a large percentage of the data. In
    > reading the Unihan Database, for example, each of the > 1 million
    > lines contains two ASCII-only fields. The character code, e.g.
    > "U+4E00" and the tag name e.g. "kRSUnicode". Only the third field will
    > contain unrestricted UTF-8 (depending on the tag).
    >
    > About 1/2 of the 28MB file therefore can be read as ASCII. Any
    > conversion is wasted effort, and performance gains became visible the
    > minute my tokenizer was retargeted to collect tokens in UTF-8.

    I haven't benchmarked it, but I would have thought reading ASCII as
    UTF-8 would be pretty efficient. Maybe I missed something.

    > The point is, there are occasional scenarios where close attention to
    > the cost of data conversion pays off. Piecemeal conversion (one line
    > at a time) definitely is too coarse, and if you wrap it into a
    > "getline" type API, that adds even more overhead. So, that's
    > recommended only where text throughput is not critical.

    Now I know I've missed something, because I definitely would not have
    expected that translating UTF-8 bytes into Unicode code points would add
    noticeable overhead to a "getline" function that reads data from
    storage.

    --
    Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
    http://www.ewellic.org
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ
    


    This archive was generated by hypermail 2.1.5 : Tue Apr 28 2009 - 22:33:09 CDT