There is little question that UTF-8 is more "reader-friendly" for
languages such as English, which uses mostly 7-bit ASCII characters,
than for languages that use more characters outside the ASCII zone.
This is an advantage IF the goal is to make the text human-readable.
As Ken Whistler points out, however, there are other reasons beyond
this for using UTF-8.
About two years ago, before I had ever heard of any of the UTFs, I
developed a stateful Unicode transformation format that was almost
completely Latin-1 compatible. (I say "almost" because ANY such
encoding must utilize at least one "escape" character, preferably
from the C0 or C1 range.)
I divided the 16-bit UCS-2 codespace into 256 blocks of 256
characters each. At any time, exactly one block was "selected in,"
and the text consisted of characters from that block of 256 Unicode
characters. The default block was 0x00 but could be changed by
using an escape character followed by a byte representing the block
to be selected. I think I chose 0x81 for the escape character,
since it was unused in Windows CP1252 and the C1 meaning, HIGH OCTET
PRESET, sounded close to what I was doing with it.
In this encoding, any Latin-1 (but not CP1252) file could be
represented directly, since the default block was 0x00. A string
such as "Walesa" in its true Polish spelling (assuming I have it
right :) would be represented like this:
Char W a l-stroke e-ogonek s a
Unicode U+0057 U+0061 U+0142 U+0119 U+0073 U+0061
Doug's 57 61 81 01 42 19 81 00 73 61
Of course, the character U+0081 itself could be represented by
81 00 81 if necessary.
Admittedly, UCS sort order was not preserved, but for the most part
this format suited my needs just fine. Every Latin-1 file (that did
not contain 0x81) was compatible with my format. All characters
were either 1 or 3 bytes long -- simple! The conversions were truly
trivial, even when compared to the simple bit-shifting of UTF-8.
Like UTF-8, no tables were needed to convert to and from UCS-2.
Everything seemed great, but there was one little problem. The
format was mine alone. Nobody else used it. Nobody else developed
software to support it. There was no standard, no RFC to proclaim
that this was the format to be used in databases, e-mail, anything.
And without any of that, the format was practically useless. As
soon as I discovered UTF-8, I ditched my proprietary format almost
Anyone who wishes to promote a new "UTF-8x" or "UTF-sane" (please,
not that -- it has the ring of other boastful monikers like "New
Technology" or "V.fast" that quickly become a joke when something
better comes along) must understand that UTF-8 is already a standard.
People are already using it; there is software that supports it and
expects it. A new UTF would not replace the "old" one, but would co-
exist with it as an incompatible variant. As somebody pointed out,
this would establish a connection in many people's minds between
"Unicode" and "a great mess of incompatible 8-bit encodings" and
they will stick with the 8-bit encodings they already have.
Several readers have mentioned as their primary complaint with UTF-8
that their Latin-1-based tools do not handle it well. This may be
true, but either they or I am missing the point. UTF-8 was not
designed to be backward-compatible with Latin-1. It doesn't seem
to make sense to judge either the UTF standard or the Latin-1 tools
against each other. They are different. To carry the example to
the logical extreme, my ASCII-oriented tools work very badly with
EBCDIC data, and vice versa.
The focus ought to be not on developing a new UTF, but on trying to
develop improved heuristics to distinguish Latin-1 data from UTF-8
data, and then building that logic into new terminal programs,
e-mail programs, and so on. In the long run, that will solve
Gunther's and Dan's problems, and everyone else's as well.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT