Re: Unicode forms for internal storage - BOCU-1 speed

From: Doug Ewell (
Date: Fri Jan 23 2004 - 01:57:15 EST

  • Next message: Jon Hanna: "Re: Unicode forms for internal storage - BOCU-1 speed"

    Markus Scherer <markus dot scherer at jtcsv dot com> wrote:

    >> BOCU-1 might solve this problem, but multiplying and dividing by 243
    >> doesn't sound faster than UTF-8 bit-shifting. (I'm still amazed by
    >> the claim in UTN #6 that converting Hindi text between UTF-16 and
    >> BOCU-1 took only 45% as long as converting it between UTF-16 and
    >> UTF-8.)
    > "claim"? That hurts...
    > I did measure these things, and the numbers in the table are all from
    > my measurements. I also included the type of machine I used, etc.
    > (

    Certainly I would never accuse Markus of falsifying these statistics.
    The word "claim" was not meant in the sense of "unsubstantiated claim."

    It did startle me that converting to BOCU-1 and SCSU could be TWICE as
    fast as converting to UTF-8, unless the I/O cost of writing two or three
    bytes is *much* slower than that of writing only one.

    > The reason why BOCU-1 (and SCSU) is often faster than UTF-8 is that
    > BOCU-1 goes into single-byte mode for small scripts like Hindi.
    > Single-byte mode only performs a subtraction, no div/mod or even bit-
    > shifting, and writes/reads only one byte per character. It is also
    > optimized in ICU with a tight inner loop.

    I'll have to see how my encoder and decoder perform when I finish them.
    They're currently written for simplicity, not speed.

    > UTF-8 is useful because it's simple, and supported just about
    > everywhere - but it's otherwise hardly optimal for anything.

    As John said, it's all about ASCII transparency, together with no false
    positives for "ASCII bytes" in non-Basic Latin characters.

    > If you want high-speed, compact encoding, use SCSU. If you want good
    > speed, compact encoding, and binary order and/or MIME compatibility,
    > use BOCU-1. Make sure that both sides of the wire know what's going
    > across.

    Always. And especially in the case of BOCU-1, since it's not
    ASCII-transparent -- although heuristic detection of BOCU-1 should be
    straightforward and very reliable.

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Fri Jan 23 2004 - 02:31:40 EST