Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Mark Davis (mark.edward.davis@gmail.com)
Date: Sun Apr 26 2009 - 12:57:49 CDT

Next message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

Previous message: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I'd disagree about that. It is certainly simpler to always process as code
points, but where performance is required, you need to design your
algorithms with the encoding of the core string representation in mind,
typically UTF-8 or UTF-16. You can get huge speedups in that way.
Take, for example, character conversion. When you are converting from UTF-8
to another encoding, you can pick up a chunk of 4 bytes at a time, and if
(chunk & 0x80808080) is zero [a very common case], then you can do the fast
lookup for those 4 bytes, without any need for special handling.
Mark

On Sun, Apr 26, 2009 at 08:40, Doug Ewell <doug@ewellic.org> wrote:

> From: "Bjoern Hoehrmann" <derhoermi@gmx.net>
>
> Now, if we replace each character by its UTF-8 encoding, we would ob-
>> tain a regular expression and corresponding automata that match the
>> same language, but would operate directly on bytes:
>>
>> /(A|B|...|a|b|...|\xC3\x80|...)(...)/
>>
>
> I know this isn't the answer you're looking for, but it almost always makes
> more sense to decode UTF-8 code units into Unicode code points FIRST and
> then apply other algorithms to operate on Unicode text, instead of trying to
> build UTF-8 decoding into every algorithm.
>
> --
> Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14
> http://www.ewellic.org
> http://www1.ietf.org/html.charters/ltru-charter.html
> http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
>
>
>

Next message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Previous message: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
In reply to: Doug Ewell: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Next in thread: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Reply: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Apr 26 2009 - 13:02:01 CDT