Re: Unicode forms for internal storage - BOCU-1 speed

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Thu Jan 22 2004 - 12:42:55 EST

  • Next message: Markus Scherer: "Re: problem - non-ASCII characters on Windows command line"

    Doug Ewell wrote:
    > BOCU-1 might solve this problem, but multiplying and dividing by 243
    > doesn't sound faster than UTF-8 bit-shifting. (I'm still amazed by the
    > claim in UTN #6 that converting Hindi text between UTF-16 and BOCU-1
    > took only 45% as long as converting it between UTF-16 and UTF-8.)

    "claim"? That hurts...

    I did measure these things, and the numbers in the table are all from my measurements. I also
    included the type of machine I used, etc. (http://www.unicode.org/notes/tn6/#Performance)

    The reason why BOCU-1 (and SCSU) is often faster than UTF-8 is that BOCU-1 goes into single-byte
    mode for small scripts like Hindi. Single-byte mode only performs a subtraction, no div/mod or even
    bit-shifting, and writes/reads only one byte per character. It is also optimized in ICU with a tight
    inner loop.

    UTF-8 on the other hand encodes Hindi with 3 bytes per character and has to perform the bit-shifting
    and write to/read from more memory locations.

    It's the same for Greek/Russian/Arabic etc., although to a lesser degree because it's single bytes
    with BOCU-1 vs. only 2 bytes per character with UTF-8.

    The fact that BOCU-1 not only achieves good compression (and binary order and MIME text/
    compatibility) but also reasonable conversion performance encouraged Mark and me to publish it.

    UTF-8 is useful because it's simple, and supported just about everywhere - but it's otherwise hardly
    optimal for anything.

    If you want high-speed, compact encoding, use SCSU. If you want good speed, compact encoding, and
    binary order and/or MIME compatibility, use BOCU-1. Make sure that both sides of the wire know
    what's going across.

    markus



    This archive was generated by hypermail 2.1.5 : Thu Jan 22 2004 - 13:35:21 EST