Re: UTF-8 based DFAs and Regexps from Unicode sets

From: Mark Davis (mark.edward.davis@gmail.com)
Date: Sun Apr 26 2009 - 12:57:49 CDT

  • Next message: Sam Mason: "Re: UTF-8 based DFAs and Regexps from Unicode sets"

    I'd disagree about that. It is certainly simpler to always process as code
    points, but where performance is required, you need to design your
    algorithms with the encoding of the core string representation in mind,
    typically UTF-8 or UTF-16. You can get huge speedups in that way.
    Take, for example, character conversion. When you are converting from UTF-8
    to another encoding, you can pick up a chunk of 4 bytes at a time, and if
    (chunk & 0x80808080) is zero [a very common case], then you can do the fast
    lookup for those 4 bytes, without any need for special handling.
    Mark

    On Sun, Apr 26, 2009 at 08:40, Doug Ewell <doug@ewellic.org> wrote:

    > From: "Bjoern Hoehrmann" <derhoermi@gmx.net>
    >
    > Now, if we replace each character by its UTF-8 encoding, we would ob-
    >> tain a regular expression and corresponding automata that match the
    >> same language, but would operate directly on bytes:
    >>
    >> /(A|B|...|a|b|...|\xC3\x80|...)(...)/
    >>
    >
    > I know this isn't the answer you're looking for, but it almost always makes
    > more sense to decode UTF-8 code units into Unicode code points FIRST and
    > then apply other algorithms to operate on Unicode text, instead of trying to
    > build UTF-8 decoding into every algorithm.
    >
    > --
    > Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14
    > http://www.ewellic.org
    > http://www1.ietf.org/html.charters/ltru-charter.html
    > http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Sun Apr 26 2009 - 13:02:01 CDT