From: Mark Davis (mark.edward.davis@gmail.com)
Date: Sun Apr 26 2009 - 12:57:49 CDT
I'd disagree about that. It is certainly simpler to always process as code
points, but where performance is required, you need to design your
algorithms with the encoding of the core string representation in mind,
typically UTF-8 or UTF-16. You can get huge speedups in that way.
Take, for example, character conversion. When you are converting from UTF-8
to another encoding, you can pick up a chunk of 4 bytes at a time, and if
(chunk & 0x80808080) is zero [a very common case], then you can do the fast
lookup for those 4 bytes, without any need for special handling.
Mark
On Sun, Apr 26, 2009 at 08:40, Doug Ewell <doug@ewellic.org> wrote:
> From: "Bjoern Hoehrmann" <derhoermi@gmx.net>
>
> Now, if we replace each character by its UTF-8 encoding, we would ob-
>> tain a regular expression and corresponding automata that match the
>> same language, but would operate directly on bytes:
>>
>> /(A|B|...|a|b|...|\xC3\x80|...)(...)/
>>
>
> I know this isn't the answer you're looking for, but it almost always makes
> more sense to decode UTF-8 code units into Unicode code points FIRST and
> then apply other algorithms to operate on Unicode text, instead of trying to
> build UTF-8 decoding into every algorithm.
>
> --
> Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14
> http://www.ewellic.org
> http://www1.ietf.org/html.charters/ltru-charter.html
> http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
>
>
>
This archive was generated by hypermail 2.1.5 : Sun Apr 26 2009 - 13:02:01 CDT