Re: Is there a UTF that allows ISO 8859-1 (latin-1)?

From: Gunther Schadow (gunther@aurora.rg.iupui.edu)
Date: Tue Aug 18 1998 - 13:49:23 EDT


Thanks for all who took the time to reply to my question. I read it
all and I have a bad feeling in my belly now. So that's why I get
somewhat wild later in the text. I appreciate comments and being
defeated, but I really do think that the last word has not been spoken
on this. I also appreciate those who think that this has been
discussed enough to send me the old threads. I just don't think it is
resolved no matter how often it has been discussed.

Some seemingly smart guy (no irony here, who was it?) wrote:
>> If you really need a Latin-1 compatible UTF, then just use UTF-7 but
>> do not transform the characters in the 0x80-0xff range. This is a
>> straight forward modification of UTF-7 and it costs you just one or
>> two bytes to change in an UTF-7 implementation. This technique is so
>> obvious and trivial that it is not even worth to write a formal
>> specification for it.
>>
>> I hope it will not become popular. Another UCS encoding is
>> certainly not what the world has been waiting for.

to which Dan Oscarsson replied:
> I agree, UTF-7 could be possible but is not wanted. My adaptive
> UTF-8 is really UTF-8, just that the software accepts not UTF-8
> encoding sequences when reading and using iso 8859-1, if possible,
> when writing. Could easily be incorporated into existing UTF-8
> software.

I disagree, especially to the passive phrase "is not wanted", who does
not want? The UTF-7 specification is one of those bad specification
that take a straight forward idea, a simple and very effective
solution and restrict it into unnecessary speciality instead of
leaving it powerful and general. I also disagree with the notion of,
yea, we have to live with bad standards.

Some said I should not overestimate the benefit of an UTF-x encoding,
some said that UTF-8 has such fine properties as sort- and
grep-ability and resynchronization. I am not at all interested in
resynchronization or whether I can sort some UTF-x encoded Unicode
with UNIX sort. And of course I can grep for UTF-7 escape sequences.

As the Unicode manual says somewhere in the front: sorting is not an
issue of the character set but of cultural conventions and a matter of
style. In German, some sort the umlauts prior to the non-umlaut vowel,
some sort it after transcribing to vowel-e (e.g., ae, oe, ue) and
again some sort without regard to plain-vowel or umlaut-vowel. In
spanish ll is sorted as one letter rather than two ells. Japanese can
be sorted according to the Iroha order, or the canonical Kana table
(with japanese ``umlauts'' (e.g., ka-ga, ha-fa-pa) inlined or
separated, or some even sort by roman transcription. BTW, ISO-Latin
sorts in a completely wrong order. Sorting is simply not a the issue
of the character set or its encoding.

Resynchronization is trivial. And after having been told the story of
terminals being brought down by UTF-8 C1 control characters (its not
just xterm, DEC VT's are not immune either), I am sure that UTF-8 is
something that is of so little benefit and causing so much harm that
its author, obviously a sophisticated player of the bit-game, had
better not been written it at all. Adaptive UTF-8 seems as a curious
but weird workaround: using heuristics to interpret the high bit is
just taking the insane bit-game to the next round of insainty.

What I do care about, and what I think is a strong use case, is that
software that used 8 bit characters and ISO-Latin should continue to
work without any change. This software was not written with Chinese or
Dingbats in mind, and that's fine. Chinese and Dingbatese people won't
use it. That's also fine. But what tremendous use is a UTF (I call it
``UTF-sane'') that allows any prior software (ASCII 7-bit and
ISO-Latin 8-bit) to continue work IN ITS REALM, unchanged? This
software would be made Unicode/UTF-sane conformant without any change
to the software! That does not mean that the old software will
magically display or sort Chinese right (UTF-8 does not sort by
stroke-index either, does it?), but that's not a problem. At least it
should not choke or break down on random C1 control characters such as
sent by UTF-8.

I think, escaping such as done by UTF-7 is the only right way to
go. It is nothing new, a well understood, a comfortable and extremely
easy way to expand any character set. Think of the various transfer
encodings in MIME and pre-MIME encodings (i.e. the one that encoded
the high bit in a leading caret), and consider German TeX's notation
of "u for umlaut-u, or TeX in general \"u for u-dieresis and \'e for
e-acut.

I do think there is a strong need for a ``UTF-sane'' specification
that is 8-bit clean and thus making use of the backwards compatibility
of Unicode that all other UTFs simpliy ignore. This spec. can be
written on one page:

(1) Use 7-bit ore 8-bit characters as underlying transport mechnaisms
    require.
(2) Encode all characters out of that range using one escape
    character E and a base 64 sequence, until base 64 ends or
    until we find the escape sequence terminator T.
(3) For in order to not disturb most normal text, use
        E = T = '~' (asciitilde, ASCII 0x7E, U+007E).

And that's it! All the rest is intro and examples, salt and sugar.

I have to blame UTF-7 for constricting itself to the ancient 7-bit
requirement. I am a German living in the U.S., and I know what I am
talking about when I say that most of the Internet's mail routers are
now 8-bit clean. ESMTP is available for years now (a decade?) and its
implemented in sendmail for years. Actually you have to switch ESMTP
off forcefully in sendmail if you don't want it, isn't it true? So why
do we hear this constant whining about Internet e-mail not being 8-bit
clean? Tell your MIME MTA to use transfer-encoding 8bit and try
it. Unless you don't live behind such insane CC-mail routers, you will
be pretty happy with it!

The other bad choice of UTF-7 was to use the plus (+) as the escape
character. Plus is a fairly frequent character, especially in
semi-scientific text, i.e. the most important kind of text encoded and
transmitted by computers, isn't it true? So, why not using some other
character. The backslash is normally the character of choice for a
visible escape character, but thinking of TeX or RTF in UTF-sane makes
me leave the backslash alone. Tilde is very rarely used, and I really
don't care if other computers show the ASCII 0x7E as a different
glyph. All I am interested in is to reserve 0x7E as the escape
character. Of course there is no reason why we have to use a different
character to end the base64 sequence.

I'd like to apologize for being wild here. I'd like to hear from
anyone who agrees to my propositions (and to my anger about the
current state of the matters). I'd like to fix things. I'd also like
to hear counter-arguments. But I warn you, "it's the standard since
'93" does not count as an argument. I am active in another standard
organization and we are all about to spent tremendous efforts to fix
our past ten year's insanity and come up with a much improved new
approach. I have a 5MB (source) software that will be rendered almost
obsolete by the new approach, so I do have some feeling of what is at
stake in extending a standard. I also know what it means to take care
of the vendors the industry that stood behind the old approach.

The good news is: hey, it's going to get much easier than it was
(despite the general ease of the matter, UTF-8 is pretty hard to
implement compared with UTF-sane, isn't it?). And to vendors outside
Unicode we can say, hey, Unicode made itself backwards compatible to
both 7-bit ASCII and ISO-Latin. So, welcome to the show! We finally
have come up with a UTF encoding that allows you to leave your
software untouched unless you want to use the extended power of
Unicode! You can embark smothly, stepwise, and you won't miss the
Unicode train!

And to both kinds of vendors we can say, you can finally talk together
on the greatest common denominator of 8-bit ISO-Latin! We made sure
that the old software won't break down when it works together with the
new software! All your legacy database systems, you can now store
Japanese person names! You might not be able to display them right,
but at least you won't get hurt by it and you won't interfere with it!
That's so much good news, who can reject it with the argument that
"it's the standard since 93, we just don't want anything new?"

regards
-Gunther

Gunther Schadow ----------------------------------- http://aurora.rg.iupui.edu
Regenstrief Institute for Health Care
1001 W 10th Street RG5, Indianapolis IN 46202, Phone: (317) 630 7960
schadow@aurora.rg.iupui.edu ---------------------- #include <usual/disclaimer>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT