Re: Is there a UTF that allows ISO 8859-1 (latin-1)?

From: Dan Oscarsson (Dan.Oscarsson@trab.se)
Date: Wed Aug 19 1998 - 05:17:44 EDT


>Resynchronization is trivial. And after having been told the story of
>terminals being brought down by UTF-8 C1 control characters (its not
>just xterm, DEC VT's are not immune either), I am sure that UTF-8 is
>something that is of so little benefit and causing so much harm that
>its author, obviously a sophisticated player of the bit-game, had
>better not been written it at all. Adaptive UTF-8 seems as a curious
>but weird workaround: using heuristics to interpret the high bit is
>just taking the insane bit-game to the next round of insainty.

There are no heuristics, it is just a variation of the rules defining how
the UTF-8 encoding is done.

Is it an insane bit-game? Trying to get a standard way to include more
character codes in a to small code space. One way is to use a tagged bit
encoding like UTF-8, an other is to use synchronisation codes for
the C0 or C1 control spaces and a third is to use escape encoding like
in UTF-7, mime quoted-printable or the %-encoding in URLs.
I very much dislike escape encodings that use printable characters, control
code escaping is better. The big disadvantage for escape encoding using
printable characters is that even though alla characters is in the basic
code space, the escape character need to be encoded.
A quoted-printable text is very difficult to read, unless it is in only ascii.
I would like to call escape encoding using printable characters insane too.

>
>What I do care about, and what I think is a strong use case, is that
>software that used 8 bit characters and ISO-Latin should continue to
>work without any change.
I agree.

> This software was not written with Chinese or
>Dingbats in mind, and that's fine. Chinese and Dingbatese people won't
>use it. That's also fine.

That is also true, too many want to solve everything for everybody.
If you use Chinese, UTF-8 is the wrong encoding to use. UCS-2 would be
better.

>I think, escaping such as done by UTF-7 is the only right way to
>go. It is nothing new, a well understood, a comfortable and extremely
>easy way to expand any character set. Think of the various transfer
>encodings in MIME and pre-MIME encodings (i.e. the one that encoded
>the high bit in a leading caret), and consider German TeX's notation
>of "u for umlaut-u, or TeX in general \"u for u-dieresis and \'e for
>e-acut.

Unfortunately, as I said above, escaping using printable characters
results in normal characters (the escape character)
needing to be escaped, even when no
other characters need to be escaped.

>
>I do think there is a strong need for a ``UTF-sane'' specification
>that is 8-bit clean and thus making use of the backwards compatibility
>of Unicode that all other UTFs simpliy ignore. This spec. can be
>written on one page:
>
>(1) Use 7-bit ore 8-bit characters as underlying transport mechnaisms
> require.
>(2) Encode all characters out of that range using one escape
> character E and a base 64 sequence, until base 64 ends or
> until we find the escape sequence terminator T.
>(3) For in order to not disturb most normal text, use
> E = T = '~' (asciitilde, ASCII 0x7E, U+007E).
>
>And that's it! All the rest is intro and examples, salt and sugar.

If you want to use a printable escape character, use one really seldom used.

For example latin-1 0xb8 (). tilde is a too common character.

But as many think UTF-8 is the way to go and adding software that can
read/write it, I ought to be easier to get people to use a slightly
modified version of the read/write routines for UTF-8 (which allows all
now produced latin-1 and UTF-8 texts to still be used) than getting them
to support a totally new set of routines.

>
>I have to blame UTF-7 for constricting itself to the ancient 7-bit
>requirement.
I agree. I protested when it was produced and said that it worked fine
for latin-1 also without any escaping of latin-1 (except for +).

> I am a German living in the U.S., and I know what I am
>talking about when I say that most of the Internet's mail routers are
>now 8-bit clean. ESMTP is available for years now (a decade?) and its
>implemented in sendmail for years. Actually you have to switch ESMTP
>off forcefully in sendmail if you don't want it, isn't it true? So why
>do we hear this constant whining about Internet e-mail not being 8-bit
>clean? Tell your MIME MTA to use transfer-encoding 8bit and try
>it. Unless you don't live behind such insane CC-mail routers, you will
>be pretty happy with it!

And many think quoted-printable is the encoding to use i e-mail instead
of 8-bit mail. Still, simple mail programs like the terminal based
Unix mailx, mail and what the now are called, cannot interpret
quoted-printable making, in my case Swedish text, very difficult to read.
Also, as mail is stored in quoted-printable instead of decoded into
native 8-bit, all texteditors, text displayers show the text encoded
making it a hard time to read.

>
>The other bad choice of UTF-7 was to use the plus (+) as the escape
>character. Plus is a fairly frequent character, especially in
>semi-scientific text, i.e. the most important kind of text encoded and
>transmitted by computers, isn't it true? So, why not using some other
>character. The backslash is normally the character of choice for a
>visible escape character, but thinking of TeX or RTF in UTF-sane makes
>me leave the backslash alone. Tilde is very rarely used, and I really
>don't care if other computers show the ASCII 0x7E as a different
>glyph.
I user ~ quite a lot. It is also unacceptable as an escape character.
As I said above, use a rarely used character from the 128-255 code range
or use a control character (now fix the misbehaving xterm instead, a
text display program should not put out control characters unless
they are intended to control the display).

>The good news is: hey, it's going to get much easier than it was
>(despite the general ease of the matter, UTF-8 is pretty hard to
>implement compared with UTF-sane, isn't it?). And to vendors outside
>Unicode we can say, hey, Unicode made itself backwards compatible to
>both 7-bit ASCII and ISO-Latin. So, welcome to the show! We finally
>have come up with a UTF encoding that allows you to leave your
>software untouched unless you want to use the extended power of
>Unicode! You can embark smothly, stepwise, and you won't miss the
>Unicode train!
>
>And to both kinds of vendors we can say, you can finally talk together
>on the greatest common denominator of 8-bit ISO-Latin! We made sure
>that the old software won't break down when it works together with the
>new software! All your legacy database systems, you can now store
>Japanese person names! You might not be able to display them right,
>but at least you won't get hurt by it and you won't interfere with it!
>That's so much good news, who can reject it with the argument that
>"it's the standard since 93, we just don't want anything new?"

Yes, it would be nice. Much better than today where to many people think
the only proper way to handle text is to use ascii (7-bits) and encode
everything using difficult to read encodings, and never bother to get
all software to never display the encoded form to human beings.

   Dan

--
Dan Oscarsson
Telia Prosoft AB                       Email: Dan.Oscarsson@trab.se
Box 85
201 20  Malmo, Sweden



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT