From: Doug Ewell (email@example.com)
Date: Fri Jan 23 2004 - 01:57:15 EST
Markus Scherer <markus dot scherer at jtcsv dot com> wrote:
>> BOCU-1 might solve this problem, but multiplying and dividing by 243
>> doesn't sound faster than UTF-8 bit-shifting. (I'm still amazed by
>> the claim in UTN #6 that converting Hindi text between UTF-16 and
>> BOCU-1 took only 45% as long as converting it between UTF-16 and
> "claim"? That hurts...
> I did measure these things, and the numbers in the table are all from
> my measurements. I also included the type of machine I used, etc.
Certainly I would never accuse Markus of falsifying these statistics.
The word "claim" was not meant in the sense of "unsubstantiated claim."
It did startle me that converting to BOCU-1 and SCSU could be TWICE as
fast as converting to UTF-8, unless the I/O cost of writing two or three
bytes is *much* slower than that of writing only one.
> The reason why BOCU-1 (and SCSU) is often faster than UTF-8 is that
> BOCU-1 goes into single-byte mode for small scripts like Hindi.
> Single-byte mode only performs a subtraction, no div/mod or even bit-
> shifting, and writes/reads only one byte per character. It is also
> optimized in ICU with a tight inner loop.
I'll have to see how my encoder and decoder perform when I finish them.
They're currently written for simplicity, not speed.
> UTF-8 is useful because it's simple, and supported just about
> everywhere - but it's otherwise hardly optimal for anything.
As John said, it's all about ASCII transparency, together with no false
positives for "ASCII bytes" in non-Basic Latin characters.
> If you want high-speed, compact encoding, use SCSU. If you want good
> speed, compact encoding, and binary order and/or MIME compatibility,
> use BOCU-1. Make sure that both sides of the wire know what's going
Always. And especially in the case of BOCU-1, since it's not
ASCII-transparent -- although heuristic detection of BOCU-1 should be
straightforward and very reliable.
This archive was generated by hypermail 2.1.5 : Fri Jan 23 2004 - 02:31:40 EST