Flexible and Economical UTF-8 Decoder

From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Mon Apr 13 2009 - 00:14:15 CDT

  • Next message: Hans Aberg: "Re: Flexible and Economical UTF-8 Decoder"


      I've written a simple UTF-8 decoding function that processes a single
    byte at a time while the caller maintains its state. As such it is much
    easier to use correctly in many situations. What makes this feasible is
    having only about a dozen instructions in the function, so it is easily
    inlined. For work in progress implementation and documentation see:


    Essentially it uses a specially constructed table-driven DFA for state
    transitions, so the decoding function just does table lookups and the
    usual bit magic. To verify that this approach is sound, I have timed a
    simple transcoder against some popular UTF-8 to UTF-16 transcoders.

    Results are somewhat compiler-, and, I imagine, architecture-specific,
    but my implementation appears to come out nicely given its simplicity.
    See the web page for results on my system.


    Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
    Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

    This archive was generated by hypermail 2.1.5 : Mon Apr 13 2009 - 00:18:22 CDT