Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Doug Ewell (doug@ewellic.org)
Date: Tue Apr 28 2009 - 22:29:15 CDT

Next message: Mark Davis: "Re: [bidi] Bidi demo"

Previous message: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Asmus Freytag <asmusf at ix dot netcom dot com> wrote:

> For UTF-8 there are many tasks where conversion can be entirely
> avoided, or can be avoided for a large percentage of the data. In
> reading the Unihan Database, for example, each of the > 1 million
> lines contains two ASCII-only fields. The character code, e.g.
> "U+4E00" and the tag name e.g. "kRSUnicode". Only the third field will
> contain unrestricted UTF-8 (depending on the tag).
>
> About 1/2 of the 28MB file therefore can be read as ASCII. Any
> conversion is wasted effort, and performance gains became visible the
> minute my tokenizer was retargeted to collect tokens in UTF-8.

I haven't benchmarked it, but I would have thought reading ASCII as
UTF-8 would be pretty efficient. Maybe I missed something.

> The point is, there are occasional scenarios where close attention to
> the cost of data conversion pays off. Piecemeal conversion (one line
> at a time) definitely is too coarse, and if you wrap it into a
> "getline" type API, that adds even more overhead. So, that's
> recommended only where text throughput is not critical.

Now I know I've missed something, because I definitely would not have
expected that translating UTF-8 bytes into Unicode code points would add
noticeable overhead to a "getline" function that reads data from
storage.

--
Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
http://www.ewellic.org
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ

Next message: Mark Davis: "Re: [bidi] Bidi demo"
Previous message: Bjoern Hoehrmann: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Asmus Freytag: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Apr 28 2009 - 22:33:09 CDT