Re: That UTF-8 Rant

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Thu Jul 22 1999 - 19:28:49 EDT


Kenneth Whistler wrote on 1999-07-22 20:37 UTC:
> > UTF-16
> > is in my eyes primarily a political correctness exercise towards the
> > users of scripts who use 6 months of Moore's law by the 3-byte encoding
> > of their characters.
>
> I agree that storage space arguments aren't usually of much value.
> Especially if you are talking about word-processing applications.
> But size does make a difference when one starts talking about
> multi-gigabyte and multi-terabyte database applications. People who make
> decisions about database design do care about such things. And data
> transmission times make a difference, too -- although this can be
> addressed with compression in either case.

Actually, I happen to be extremely interested in exactly these
questions, because I happen to be someone who makes implementation
decisions about databases that could one day grow into the
hundreds-of-gigabyte range. I have not yet seen multi-terabyte plain
text databases though (perhaps the email/fax eavesdroppers at the NSA
have these, if anyone ;-), these tend more to be filled with images and
not text.

Therefore a few of my observations in this field:

 - While RAM and disk space isn't a big issue any more these days,
   bus and network bandwidth and the speed of search algorithms still
   is and will continue to be for some time

 - Network bandwidth can be taken care of by LZ-style compression
   algorithms, but CPU bus bandwidth can't.

 - Bus bandwidth can be a limiting factor in index traversal and full-text
   substring searches.

 - The vast majority (>> 80%) of characters handled today in networked
   databases are 7-bit ASCII. I have yet to see a single >10 gigabyte
   database consisting predominantly of non-Latin text (outside the
   basement of US intelligence agencies :). This is not an issue of
   the deployment of Unicode, because suitable national non-Latin
   character sets have been around for over 15 years.

 - Given the global mix ratio of ASCII versus non-ASCII characters used
   on the Internet today, I believe that UTF-8 is on average almost half
   as short as UTF-16.

 - Many important algorithms such as fulltext string search and B-tree
   prefix lookups can equally easily be implemented in both UTF-8 and
   UTF-16, however their execution speed is proportional to the number
   of bits required by the encoding and transfered through the bus
   bottleneck.

 - UTF-8 is a very simple compression algorithm that thanks to its
   stateless encoding is compatible with most string search and indexing
   algorithms, while better compression algorithms such as gzip
   are certainly not.

I believe that certainly in the western world, but most likely also on a
global average, UTF-8 gives therefore a close to 50% performance
improvement over UTF-16 in database lookups.

I am well aware that my experimental evidence here is not yet very
complete, but at least I wouldn't dismiss the use of UTF-8 in
high-performance database applications immediately based on performance
reasons. It might well be the more efficient solution in real-live
applications.

> I would consider Microsoft
> Windows a "big field of application" by any reasonable measure.
> Java is another "big field of application". IBM and Apple both make
> extensive use of UTF-16. I won't start running down the medium-sized
> companies using it.

I am not in favour of quoting large companies who have made certain
technical decisions as an argument for the quality of a specific
technical solution. We all have read enough Dilbert to understand how
technical decisions are usually made today in our hype-driven commercial
world. At least I am more skeptically amused than impressed by "it must
be right way because Microsoft Windows, IBM, Apple, etc. also do it"
(especially if the word Java appears in the same paragraph).

Millions of flies can't be wrong: manure tastes lovely.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT