Re: That UTF-8 Rant (was Unicode in source)

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jul 22 1999 - 16:40:52 EDT


Markus,

> UTF-16. I have not yet seen code that is in any way nicer or more
> elegant on UTF-16 then on UTF-8.

I guess you haven't been looking very hard then.

> For table-lookup operations, both UTF-8
> and UTF-16 have to be converted into a 31-bit integer value via a
> function such as mbtowc() anyway.

Nonsense. An effective staged trie lookup can be done directly off
the UTF-16 string value, without conversion to a 32-bit integer.
It is much less effective to try to do that off a UTF-8 string
value without conversion to an integer.

> And the simplicity advantage that
> UCS-2 has over UTF-8 (ignoring not only Klingon but also the advanced
> mathematical publishing characters in ISO 10646-2) vanishes very quickly
> with combining characters, which are not only essential for languages
> such as Thai but also for mathematical publishing.

You are mixing apples and oranges here. The complexity attendant
to processing of combining characters is an epiphenomenon that
sits on top of *any* of the encoding forms for Unicode.

>
> I do not believe there is a big field of application for UTF-16.

Again, I guess you didn't look very hard. I would consider Microsoft
Windows a "big field of application" by any reasonable measure.
Java is another "big field of application". IBM and Apple both make
extensive use of UTF-16. I won't start running down the medium-sized
companies using it.

> UTF-16
> is in my eyes primarily a political correctness exercise towards the
> users of scripts who use 6 months of Moore's law by the 3-byte encoding
> of their characters.

I agree that storage space arguments aren't usually of much value.
Especially if you are talking about word-processing applications.
But size does make a difference when one starts talking about
multi-gigabyte and multi-terabyte database applications. People who make
decisions about database design do care about such things. And data
transmission times make a difference, too -- although this can be
addressed with compression in either case.

>
> Efficiency?
>
> Sure, UTF-16 might save you a few CPU cycles over UTF-8 in this
> conversion to UCS-4 here and there, but again this is just a week or
> less of Moore's law. People are today very happy to loose at least an
> order of magnitude more CPU cycles by using interpreted Perl/Python/
> Java/TCL instead of good old compiled C. Low-cost PCs have been much
> faster then necessary for word processing for several years now. Even
> Microsoft runs out of ideas of how to further bloat word processors
> these days. CPU cycles are burned today with real-time rendered
> anti-aliased fonts; UTF-8 is much too efficient here.

Once again, talking about efficiency in the context of word processors
is besides the point. The real issues of efficiency are down in
the servers straining to serve up those less-than-three-second
response times on transactions that are distributed across the net to
tens of thousands of users. People work hard to shave milliseconds
wherever they can in the servers.

>
> > In short, my grinding axe says: write code for UTF-16. Where possible, store
> > UTF-16.
>
> Which UTF-16 do you mean? There are at least two mutually incompatible
> ones around.

See my previous note. You need to distinguish the encoding forms
from the serializations.

--Ken

>
> Markus
>
> --
> Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
> Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT