Re: Unicode forms for internal storage - BOCU-1 speed

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Thu Jan 22 2004 - 12:42:55 EST

Next message: Markus Scherer: "Re: problem - non-ASCII characters on Windows command line"

Previous message: Andrew C. West: "Re: Chinese FVS? (was: RE: Cuneiform Free Variation Selectors)"
In reply to: Doug Ewell: "Re: Unicode forms for internal storage"
Next in thread: jcowan@reutershealth.com: "Re: Unicode forms for internal storage - BOCU-1 speed"
Reply: jcowan@reutershealth.com: "Re: Unicode forms for internal storage - BOCU-1 speed"
Maybe reply: Kenneth Whistler: "Re: Unicode forms for internal storage - BOCU-1 speed"
Reply: Doug Ewell: "Re: Unicode forms for internal storage - BOCU-1 speed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Doug Ewell wrote:
> BOCU-1 might solve this problem, but multiplying and dividing by 243
> doesn't sound faster than UTF-8 bit-shifting. (I'm still amazed by the
> claim in UTN #6 that converting Hindi text between UTF-16 and BOCU-1
> took only 45% as long as converting it between UTF-16 and UTF-8.)

"claim"? That hurts...

I did measure these things, and the numbers in the table are all from my measurements. I also
included the type of machine I used, etc. (http://www.unicode.org/notes/tn6/#Performance)

The reason why BOCU-1 (and SCSU) is often faster than UTF-8 is that BOCU-1 goes into single-byte
mode for small scripts like Hindi. Single-byte mode only performs a subtraction, no div/mod or even
bit-shifting, and writes/reads only one byte per character. It is also optimized in ICU with a tight
inner loop.

UTF-8 on the other hand encodes Hindi with 3 bytes per character and has to perform the bit-shifting
and write to/read from more memory locations.

It's the same for Greek/Russian/Arabic etc., although to a lesser degree because it's single bytes
with BOCU-1 vs. only 2 bytes per character with UTF-8.

The fact that BOCU-1 not only achieves good compression (and binary order and MIME text/
compatibility) but also reasonable conversion performance encouraged Mark and me to publish it.

UTF-8 is useful because it's simple, and supported just about everywhere - but it's otherwise hardly
optimal for anything.

If you want high-speed, compact encoding, use SCSU. If you want good speed, compact encoding, and
binary order and/or MIME compatibility, use BOCU-1. Make sure that both sides of the wire know
what's going across.

markus

Next message: Markus Scherer: "Re: problem - non-ASCII characters on Windows command line"
Previous message: Andrew C. West: "Re: Chinese FVS? (was: RE: Cuneiform Free Variation Selectors)"
In reply to: Doug Ewell: "Re: Unicode forms for internal storage"
Next in thread: jcowan@reutershealth.com: "Re: Unicode forms for internal storage - BOCU-1 speed"
Reply: jcowan@reutershealth.com: "Re: Unicode forms for internal storage - BOCU-1 speed"
Maybe reply: Kenneth Whistler: "Re: Unicode forms for internal storage - BOCU-1 speed"
Reply: Doug Ewell: "Re: Unicode forms for internal storage - BOCU-1 speed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 22 2004 - 13:35:21 EST