Re: [unicode] UTF-c

From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Mon Feb 21 2011 - 15:49:23 CST

  • Next message: Koji Ishii: "Titlecasing words starting with numeric glyphs and period as word separator"

    * Philippe Verdy wrote:
    >And anyway it is also much simpler to understand and easier to
    >implement correctly (not like the sample code given here) than SCSU,
    >and it is still very highly compressible with standard compression
    >algorithms while still allowing very fast processing in memory in its
    >decompressed encoded form :
    >- a bit faster than UTF-8, as seen in my early benchmarks, for small
    >number of large texts such as pages in a Wiki database,
    >- but a bit slower for large number of small strings such as tabular
    >data, because of the higher number of conditional branches when using
    >a CPU with a 1-way instruction pipeline (not a problem with today's
    >processors that include a dozen of parallel pipelines even in a single
    >core, if the compiled assembly code is correctly optimized and
    >scheduled to make use of them when branch-prediction cannot help
    >much).

    It seems to me from a very very brief look that you can eliminate much
    of the conditional logic there in the same manner in which I removed it
    in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ from the UTF-8 decoder
    as far as decoding goes (there you could completely eliminate branches,
    but it would cost you a register, among other things, as I recall). The
    main performance problem I encountered when developing the decoder was
    actually compilers being silly...

    -- 
    Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
    Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
    


    This archive was generated by hypermail 2.1.5 : Mon Feb 21 2011 - 15:51:06 CST