RE: That UTF-8 Rant (was Unicode in source)

From: Christophe PIERRET (cpierret@businessobjects.com)
Date: Fri Jul 23 1999 - 09:27:03 EDT


> -----Original Message-----
> From: kenw@sybase.com [mailto:kenw@sybase.com]
> Sent: July 22, 1999 10:38 PM
> To: Unicode List
> Cc: unicode@unicode.org; kenw@sybase.com
> Subject: Re: That UTF-8 Rant (was Unicode in source)
>
>
> Markus,
>
> > UTF-16. I have not yet seen code that is in any way nicer or more
> > elegant on UTF-16 then on UTF-8.
>
> I guess you haven't been looking very hard then.

I strongly agree.

Looking at my code and rethinking about its design, I really prefer the
UTF-16 versions versus the UTF-8 ones
(even if surrogates are not that simple to handle).
As an added bonus, on Windows NT it really interoperates well with the
system ...

>
> > For table-lookup operations, both UTF-8
> > and UTF-16 have to be converted into a 31-bit integer value via a
> > function such as mbtowc() anyway.
>
> Nonsense. An effective staged trie lookup can be done directly off
> the UTF-16 string value, without conversion to a 32-bit integer.
> It is much less effective to try to do that off a UTF-8 string
> value without conversion to an integer.

I agree, code is almost always simpler for UTF-16.

>
> > And the simplicity advantage that
> > UCS-2 has over UTF-8 (ignoring not only Klingon but also
> the advanced
> > mathematical publishing characters in ISO 10646-2) vanishes
> very quickly
> > with combining characters, which are not only essential for
> languages
> > such as Thai but also for mathematical publishing.
>
> You are mixing apples and oranges here. The complexity attendant
> to processing of combining characters is an epiphenomenon that
> sits on top of *any* of the encoding forms for Unicode.
>
> >
> > I do not believe there is a big field of application for UTF-16.
>
> Again, I guess you didn't look very hard. I would consider Microsoft
> Windows a "big field of application" by any reasonable measure.
> Java is another "big field of application". IBM and Apple both make
> extensive use of UTF-16. I won't start running down the medium-sized
> companies using it.

Again, I agree.

>
> > UTF-16
> > is in my eyes primarily a political correctness exercise towards the
> > users of scripts who use 6 months of Moore's law by the
> 3-byte encoding
> > of their characters.
>
> I agree that storage space arguments aren't usually of much value.
> Especially if you are talking about word-processing applications.
> But size does make a difference when one starts talking about
> multi-gigabyte and multi-terabyte database applications.
> People who make
> decisions about database design do care about such things. And data
> transmission times make a difference, too -- although this can be
> addressed with compression in either case.

That's why mixing UTF-8 and UTF-16 storage can be a good idea.
If I work in Japanese I prefer UTF-16, in English UTF-8...
UTF-8 is somewhat a Western-centric encoding as it favors latin scripts
(ASCII is beautiful ;-)

>
> >
> > Efficiency?
> >
> > Sure, UTF-16 might save you a few CPU cycles over UTF-8 in this
> > conversion to UCS-4 here and there, but again this is just a week or
> > less of Moore's law. People are today very happy to loose
> at least an
> > order of magnitude more CPU cycles by using interpreted Perl/Python/
> > Java/TCL instead of good old compiled C. Low-cost PCs have been much
> > faster then necessary for word processing for several years
> now. Even
> > Microsoft runs out of ideas of how to further bloat word processors
> > these days. CPU cycles are burned today with real-time rendered
> > anti-aliased fonts; UTF-8 is much too efficient here.
>
> Once again, talking about efficiency in the context of word processors
> is besides the point. The real issues of efficiency are down in
> the servers straining to serve up those less-than-three-second
> response times on transactions that are distributed across the net to
> tens of thousands of users. People work hard to shave milliseconds
> wherever they can in the servers.

I'm convinced it's not that simple!
I benchmarked most of my algorithms on pentium II machines and the results
are:
Efficiency of Unicode text processing algorithms depends on encodings AND
language.

In my case:
Sorting any latin script language is faster with UTF-8 (using Unicode
Collation Algorithm)
while it was faster to use UTF-16 for Japanese or Russian.
The same applies to regex search.

Kind of "the less memory accessed, the faster".
I never tried on a 386 or older architecture where no (or little) cache and
instruction pipelining were available.
I guess UTF-16 would be the most efficient there.

If someone experienced the same strange behavior, please tell me !

>
> >
> > > In short, my grinding axe says: write code for UTF-16.
> Where possible, store
> > > UTF-16.
> >
> > Which UTF-16 do you mean? There are at least two mutually
> incompatible
> > ones around.
>
> See my previous note. You need to distinguish the encoding forms
> from the serializations.
>
> --Ken
>
> >
> > Markus
> >
> > --
> > Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
> > Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
> >
> >
>

Chris



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT