Re: Endless endianness annoyance

From: Adrian Havill (havill@threeweb.ad.jp)
Date: Wed Dec 03 1997 - 22:52:43 EST


> Not exactly; I'm saying UTF-8 is faster because it (almost always in the
> *aggregate*) means fewer bits transferred, and that is the bottleneck; it's
> easier because you don't have to even think about byte order.

That's a very Greek/Cyrillic/Armenian/Hebrew/Arabic/Latin-centric way of
thinking. (^_^) Us folks using the U+0900 and above range have to put up with
the 3-byte UTF-8 "consumption tax."

Example real-world text (containing a lot of simple one-byte embedded English
ASCII examples... additionally, many CRLF sequences are present as well, which
UTF-8 doesn't expand [compared to ASCII], but UCS-2 does):

======

The EFF Netguide translated into Japanese:

[Available from
<URL:http://www.eff.org/pub/Net_info/EFF_Net_Guide/Other_versions/Japanese/netgd
_jp.sjis.gz>]

Shift-JIS original: 375884 (367K)

converted to UCS-2: 473476 (462K) +26%
converted to UTF-7: 481143 (470K) +28%
converted to UTF-8: 514763 (503K) +37%

[Converted files available from:
<URL:http://www.threeweb.ad.jp/~havill/netguide.----.txt> where the "----" is
ucs2, utf7, or utf8]

despite the large amount of embedded English and TTY screen dumps, and carriage
returns, which UTF-8 actually compresses (compared to UCS-2), UCS-2 is the big
winner for large Japanese texts. If others can provide conversion (UTF-8, UTF-7,
UCS-2) efficiency performance numbers with other, large, real-world Japanese
texts, I'd be interested in seeing them (we've already done the Teachings of
Buddha and the Bible... as these were pure Kanji, the blowup was over 40% with
UTF-8... we're interested in large modern Japanese texts with often contain a
liberal amount of Latin characters sprinkled in them).

======

Another example (that I regrettably can't release numbers for) was when we
converted our Oracle databases from EUC to AL24UTFFSS, the blowup was extreme:
as the data was pure First and Last names and addresses in Kanji (the only
things in the U+0000 to U+0900 range were the postal code numbers and the
district, block, and house numbers), the blowup was over 40% (We had to adjust
the VARCHAR lengths to accommodate the new UTF format). For a large database,
the penalty of using UTF-8 was immense. Had we the option of using UCS-2
(technically, we could by storing it as raw binary, but we'd lose out on the
built in conversion and string routines built-in the system), the difference
would have been minimal.

UTF-8 has many good points. In particular, it's very resistant to the Japanese
user nightmare of Moji-bake (where one corrupt character can corrupt all data up
to the next non-JIS in the case of ISO-2022-JP and EUC-JP).

But I think it's a bit of an overstatement to tout UTF-8 as the cure-all,
end-all for Unicode transmission. In environments where the transmission is
8-bit clean and error free and the byte order known (or a guarantee that a
byte-order marker will by present at the head of a UCS-2 stream), I see little
benefit in using it over UCS-2.

(A wish list to database manufacturers: please support UCS-2 internal coding,
not just UTF-8... the performance hit that CJK users take from UTF-8 is too
large when you're talking about very large sets of pure name and address data)

Easier? Maybe more robust to errors, but not easier. True, a few shifts, binary
AND and ORs and a compare or two is all you need (done very fast on almost any
CPU) to convert UTF-8 to 16-bit form for internal use. Very easy. No byte-order
hassle.

But UCS-2 reading in [and converting to proper endian form] is both easier and
faster. Even if the endian is wrong, many CPUs (the 80386 and above) can convert
the endian of a 16/32 bit value via a built-in endian swapper in its instruction
set, which is faster than the seven or eight shifts/bit masks/cmp instructions
necessary to get UTF-8 into 16-bit form.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:38 EDT