From: Doug Ewell (email@example.com)
Date: Fri May 20 2005 - 08:52:12 CDT
Peter Kirk <peterkirk at qaya dot org> wrote:
> Actually, much of the Internet infrastructure can still deal only with
> 7-bit characters, as we have been discussing on another thread. In
> order to carry 8-bit data, whether legacy encoded or UTF-8, across the
> Internet, it is apparently necessary to insert a low level "Quoted
> Printable" encoding layer to recode any bytes with the top bit set as
> three characters, leading to gross inefficiency in transmission of
> anything other than ASCII text - any UTF-8 encoded Unicode character
> beyond U+0080 is transmitted as between six and twelve bytes in this
> encoding. If we can tolerate this kind of extra layer to carry 8-bit
> character based data on a 7-bit medium, surely we can tolerate a
> similar layer to carry 32-bit character data on a 7-bit or 8-bit
> medium, for a transitional period until the Internet or its successor
> is upgraded to support 32-bit data at its lowest levels. And it should
> be possible to devise a suitably efficient encoding which is a lot
> less inefficient than UTF-8 over "Quoted Printable". Well, of course
> UTF-7 and UTF-8 are suitable encodings, but I am understanding them
> here as being used as content transfer encodings rather than as
> character sets.
UTF-7 was indeed created for exactly this purpose, to represent
non-Basic-Latin text more compactly than UTF-8 plus quoted-printable.
People hate it because it is hard to read in "bare" form (since the
character repertoire overlaps heavily with "normal" text) and hard to
process (character boundaries don't fall along encoded byte boundaries).
But it's unlikely that any other encoding scheme could have done any
better, given the same constraints. Believe me, those of us who like
playing with bits have tried.
-- Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Fri May 20 2005 - 08:54:40 CDT