Re: That UTF-8 Rant

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jul 22 1999 - 20:09:35 EDT


Markus Kuhn responded:

> >
> > I agree that storage space arguments aren't usually of much value.
> > Especially if you are talking about word-processing applications.
> > But size does make a difference when one starts talking about
> > multi-gigabyte and multi-terabyte database applications. People who make
> > decisions about database design do care about such things. And data
> > transmission times make a difference, too -- although this can be
> > addressed with compression in either case.
>
> Actually, I happen to be extremely interested in exactly these
> questions, because I happen to be someone who makes implementation
> decisions about databases that could one day grow into the
> hundreds-of-gigabyte range. I have not yet seen multi-terabyte plain
> text databases though (perhaps the email/fax eavesdroppers at the NSA
> have these, if anyone ;-), these tend more to be filled with images and
> not text.

Well, such things have existed for some time now, probably led by
the consumer packaged goods industry, sitting on the other end of all
those retail sale scanners, but also the marketing warehouses assembled
by the credit card companies, and databases used by the big insurance
firms. There is a reason why Teradata Corporation stuck the "Tera" on their name.
Yeah, they aren't just text data, but they are relational databases
consisting mostly of text and numeric data fields -- hundreds of millions
of records, and not stuffed with image data like the satellite archives.

[Many good technical observations omitted.]

>
> I believe that certainly in the western world, but most likely also on a
> global average, UTF-8 gives therefore a close to 50% performance
> improvement over UTF-16 in database lookups.

I concur with your conclusion, for all the reasons you have cited.
If you have 4 terabytes of ASCII data, and need to mix in a little
bit of other data for your overseas operations, UTF-8 is an obvious
choice -- both for migration and for performance.

But the performance arguments cut both ways. If you are a Chinese
company trying to set up a data warehouse operation, and you need
to include, say, Japanese and/or Korean data not expressible in
a legacy Chinese character set, you could expect better performance
if your database supported UTF-16 as a storage and processing form
than if it only supported UTF-8.

>
> I am well aware that my experimental evidence here is not yet very
> complete, but at least I wouldn't dismiss the use of UTF-8 in
> high-performance database applications immediately based on performance
> reasons. It might well be the more efficient solution in real-live
> applications.

Nor do I. I completely agree with your conclusion. There are a couple
of reasons why both Oracle and Sybase have had UTF-8 support in their
databases for some time now, and are only now getting around to
UTF-16 support. Customer needs and performance for the kind of
data sets they have are among those reasons.

>
> > I would consider Microsoft
> > Windows a "big field of application" by any reasonable measure.
> > Java is another "big field of application". IBM and Apple both make
> > extensive use of UTF-16. I won't start running down the medium-sized
> > companies using it.
>
> I am not in favour of quoting large companies who have made certain
> technical decisions as an argument for the quality of a specific
> technical solution. We all have read enough Dilbert to understand how
> technical decisions are usually made today in our hype-driven commercial
> world.

But I think you might consider consulting the technical people
who made many of the Unicode implementation decisions in their
companies to see how many of them feel they were made by Catbert
throwing darts at the dartboard.

> At least I am more skeptically amused than impressed by "it must
> be right way because Microsoft Windows, IBM, Apple, etc. also do it"
> (especially if the word Java appears in the same paragraph).
>
> Millions of flies can't be wrong: manure tastes lovely.

Well, to continue your metaphor unnecessarily, flies also cluster
around dead horses, but I'm not sure I'd want to pick my horse
on that basis.

--Ken

>
> Markus
>
> --
> Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
> Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT