Re: Why "UTF-5" is not a UTF

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Mar 03 2000 - 18:05:38 EST


John Cowan asked:

>
> Kenneth Whistler wrote:
>
> > "A transfer encoding syntax is a reversible transform of encoded data which
> > may (or may not) include textual data represented in one or more character
> > encoding schemes."
> >
> > TES's are things like base64, uuencode, BinHex, quoted-printable, etc., that
> > are designed to convert textual (or other) data into sequences of byte
> > values that avoid particular values that would confuse one or more Internet or
> > other transmission/storage protocols.
>
> But UTF-8 was originally created in order to avoid the octets 00 and 2F in the
> representation of any characters other than U+0000 and U+002F, because the
> Unix and Plan 9 filesystems were sensitive to those octets.
>
> What's the difference?
>

TES's fundamentally are *not* character data. They are "encrypted" forms of
data, which may include character data, and may also include other data.
(Yes, I realize they are not technically "encryptions", but "codings" -- but
I am trying to avoid the polysemous slipperiness of the term "coding", which
everyone on this list automatically associates in their heads with
"character coding".)

The point is that nobody treats, stores, transmits, or edits a TES as if
it were a representation of plain text per se. Instead, it is a specially
engineered sequence of bytes that is designed to convey information while
avoiding byte values that various protocols hold sacred for other reasons.

Nobody except the strangely benighted would write a uuencode editor that used
uuencode as its in-memory backing store. It is not designed to be a character
encoding form. On the other hand, character encoding forms are *precisely*
what we see as in-memory representations of text. They are the computer-enabled
version of the sets of integers defined in the character encoding standards.

The interesting case of UTF-8 is that it was carefully designed as a
character encoding form for Unicode that a) used 8-bit code units (thus
making it usable with a vast amount of preexisting software and protocols),
and b) did the mapping for the CCS integers to code units in a clever way,
*so as to minimize the need for additional TES definitions*. In other
words, because of the way the mapping was done into code units, UTF-8 is
already almost entirely Internet protocol safe -- so it can be sent around
without having to package it up in yet another TES.

However, UTF-8 *is* clearly a character encoding form. It is officially
listed and sanctioned as such by *both* of the standards (using slightly
different terminology, I admit). And it clearly *is* used by many implementations
as a direct, in-memory representation of Unicode characters. It is not an
"encryption" of the character data intended to pass some other protocol -- it
is the character data per se.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT