Re: Is there a UTF that allows ISO 8859-1?

From: Gunther Schadow (gunther@aurora.rg.iupui.edu)
Date: Wed Aug 26 1998 - 11:56:04 EDT


Dough,

I like the general tune of your mail, although its kind of a sad tune.

> About two years ago, before I had ever heard of any of the UTFs, I
> developed a stateful Unicode transformation format that was almost
> completely Latin-1 compatible. (I say "almost" because ANY such
> encoding must utilize at least one "escape" character, preferably
> from the C0 or C1 range.)
>
> I divided the 16-bit UCS-2 codespace into 256 blocks of 256
> characters each. At any time, exactly one block was "selected in,"
> and the text consisted of characters from that block of 256 Unicode
> characters. The default block was 0x00 but could be changed by
> using an escape character followed by a byte representing the block
> to be selected. I think I chose 0x81 for the escape character,
> since it was unused in Windows CP1252 and the C1 meaning, HIGH OCTET
> PRESET, sounded close to what I was doing with it.

Your "stateful Unicode transformation seems just great to me,
especially because it does not just ease 7bit ASCII or ISO Latin-1
reading but as you can switch to any default 256 character block. For
example, you could switch to hebrew and you could send all your hebrew
text essentially in 8-bit characters! I could imagine this to be
refined such that you could virtually overlay two 128 character blocks
so that you could emulate all those US-ASCII + CODE-PAGE-xxx
combinations! (I presume that the other code blocks have been layed
out in some way compatible to such a pre existing code page) So, that
would have been an even better Unicode encoding! (And I will drop the
name UTF-sane for my weak improvement, because your idea was even more
sane!) It is sad, though, that you simply tossed it after having seen
UTF-8.
 
> Admittedly, UCS sort order was not preserved, but for the most part
> this format suited my needs just fine. Every Latin-1 file (that did
> not contain 0x81) was compatible with my format. All characters
> were either 1 or 3 bytes long -- simple! The conversions were truly
> trivial, even when compared to the simple bit-shifting of UTF-8.
> Like UTF-8, no tables were needed to convert to and from UCS-2.
>
> Everything seemed great, but there was one little problem. The
> format was mine alone. Nobody else used it. Nobody else developed
> software to support it. There was no standard, no RFC to proclaim
> that this was the format to be used in databases, e-mail, anything.
> And without any of that, the format was practically useless. As
> soon as I discovered UTF-8, I ditched my proprietary format almost
> immediately.

You should have written the RFC, you should have submitted this as a
competitor to UTF-8 at times when it wasn't too late!

> Anyone who wishes to promote a new "UTF-8x" or "UTF-sane" (please,
> not that -- it has the ring of other boastful monikers like "New
O.K.
> Technology" or "V.fast" that quickly become a joke when something
> better comes along) must understand that UTF-8 is already a standard.
> People are already using it; there is software that supports it and
> expects it. A new UTF would not replace the "old" one, but would co-
> exist with it as an incompatible variant. As somebody pointed out,
> this would establish a connection in many people's minds between
> "Unicode" and "a great mess of incompatible 8-bit encodings" and
> they will stick with the 8-bit encodings they already have.

There are two issues here: having two mutually incompatible UTF
encoding standards is one thing, which is not that bad if you account
for the benefits that your stateful UTF has in terms of instant
compatibility of non-UTF aware software. The other thing is the image
of Unicode in "people's mind". Because there is nothing better than
Unicode (since sliced bread) and because you could comply to Unicode
this easily with the non-UTF-8 encoding, it wouldn't hurt too much.

> Several readers have mentioned as their primary complaint with UTF-8
> that their Latin-1-based tools do not handle it well. This may be
> true, but either they or I am missing the point. UTF-8 was not
> designed to be backward-compatible with Latin-1. It doesn't seem

Right, that's why I don't think that UTF-8 was well designed.

> to make sense to judge either the UTF standard or the Latin-1 tools
> against each other. They are different. To carry the example to
> the logical extreme, my ASCII-oriented tools work very badly with
> EBCDIC data, and vice versa.

Oh, that's carrying the example a bit too far. EBCDIC and ASCII are
totally incompatible, as far as I know, there is not a single code
position that both of them have in common. Conversely, Unicode was
apparently designed with backwards compatibility to US-ASCII AND ISO
Latin-1 in mind! The reason why I am so angry about UTF-8 is that
UTF-8 simply ignores this feature of the Unicode design and thought it
would be a great thing only because it supported backwards
compatibility to US-ASCII.

> The focus ought to be not on developing a new UTF, but on trying to
> develop improved heuristics to distinguish Latin-1 data from UTF-8
> data, and then building that logic into new terminal programs,
> e-mail programs, and so on. In the long run, that will solve
> Gunther's and Dan's problems, and everyone else's as well.

But this requires that you have a hand on the source code! And that's
difficult for the most part, I guess you all know that. In my (Unix)
world, it is no big deal and I think I could clean up my personal
Worksatation to use Unicode excluively. But I am not talking about my
personal software and hobby, I am talking about business empires where
programmers are hidden behind meter-thick walls from the public
guarded by legions of sales-people who don't make a big difference
between a vacuum-cleaner and a computer program.

I was near to give up on a better UTF, but now that I heared about
your encoding and added to it the code block overlay technique, I want
to even more work out something that's better than UTF-8. Now that the
UTF-sane (sorry, I agreed to you above to not use UTF-sane, but it is
a codename for an ongoing project, it will be changed to something
more durable, when time has come) ... now that UTF-sane is no longer
dedicated to Latin-1 compatibility, chances are I could even get
non-western-europeans into the boat.

regards
-Gunther

Gunther Schadow ----------------------------------- http://aurora.rg.iupui.edu
Regenstrief Institute for Health Care
1001 W 10th Street RG5, Indianapolis IN 46202, Phone: (317) 630 7960
schadow@aurora.rg.iupui.edu ---------------------- #include <usual/disclaimer>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT