Re: UTF-8 versus UTF-16 bandwidth

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Wed Aug 18 1999 - 04:02:55 EDT


Frank da Cruz wrote on 1999-08-17 21:56 UTC:
> > Personally, I'd prefer just one format for Unicode exchange: UTF-8. It's
> > not perfect, but no interface is. And on balance it seems better than any
> > alternative.
> >
> I think the main problem with having UTF-8 but not UTF-16 (or vice versa)
> is that one requires greater bandwidth than the other for different classes
> of writing systems.

Fully negligible. Internet bandwidth is consumed by GIF, MPEG and
RealAudio, not by Unicode, hey, not even by ASCII. Lempl-Ziv compression
reduces the 50% overhead of UTF-8 over UTF-16 for CJK down to just a few
percent, and compression is used today everywhere where bandwidth really
does matter (e.g., V.42bis in phone modems). So the UTF-8 bandwidth
overhead is really more of an academic problem (or worse, a political
pretext). There might be slightly more noticeable performance
differences between UTF-16 and UTF-8 in some very large database
applications and especially large scale full-text information retrieval
engines (depending much on how clever they are implemented), but these
are completely outside the scope of MIME anyway. The world would really
be fine if MIME never used UTF-16 and stayed with UTF-8 exclusively. Add
a mechanism to get UTF-8+gzip instead if you truely worry about
bandwidth, because it outperforms uncompressed UTF-16 significantly!

(BTW: I hear that ISO, ECMA, and ITU have standardized a couple of
Lempl-Ziv style text compression algorithms. Is one of them compatible
to gzip by any chance, the only one really widely used on the Internet
at the moment?

ISO/IEC 11558:1992 Information technology -- Data compression for
information interchange -- Adaptive coding with embedded dictionary
-- DCLZ Algorithm

ISO/IEC 11576:1994 Information technology -- Procedure for the
registration of algorithms for the lossless compression of data

ISO/IEC 12042:1993 Information technology -- Data compression for
information interchange -- Binary arithmetic coding algorithm

ISO/IEC 15200:1996 Information technology -- Adaptive Lossless Data
Compression algorithm (ALDC)

)

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT