Re: Is there Unicode mail out there?

From: Gaute B Strokkenes (gs234@cam.ac.uk)
Date: Sat Jul 14 2001 - 21:01:10 EDT


On Sat, 14 Jul 2001, dstarner98@aasaa.ofe.org wrote:
>> > How about just supporting these: ISO646-PT, ISO10646-UTF-1,
>> > NATS-SEFI and HP-DeskTop?
>>
>> I'm not sure what you're trying to say here. Assuming these are
>> properly registered charsets, it seems like a very narrow range to
>> support.
>
> Maybe "supporting at least these" would have been a better
> phrasing. They're all valid and registered MIME-charsets. Do you
> know of a single mailer that supports all 4?

OK, I get your point. There are a lot of obscure charsets out there,
and it's probably not necessary to make sure that mail clients
understand all of them since a lot of these have no precedent for use
in email. Nevertheless, there are a number of charsets--ISO-8859-1,
ISO-8859-2, KOI8-R, Shift_JIS and so on--that have widespread
precedent for use in email, and are de-facto standards for email in
certain languages. It would be extremely foolish to implement a mail
client that understands UTF-8 but not these.

>> If we all had to upgrade our software to do so, I think a lot of
>> people just wouldn't bother.
>
> You're claiming on one hand that everyone's mailer should handle all
> sorts of charsets, and on the other using one that doesn't support
> the only charset that is RFC-mandated for a working mail program to
> support.

I'm sorry, but you're mixing things up a bit. Keep in mind that in
general there is a difference between what processes implementing
Internet protocols should generate and what they are required to
accept. One of the principles that the Internet is founded on is to
"be liberal in what you accept, and conservative in what you produce".

> (Yes, a mailer that doesn't handle UTF-8 violates the appropriate
> RFCs.)

Chapter and verse, please? The only document I could find that puts
forth such a requirement is the one at:

  http://www.imc.org/mail-i18n.html

which is not a RFC. Other than that, there is RFC 2277; however this
only states that protocols must make it possible to exchange textual
data using UTF-8; it doesn't make it mandatory to understand UTF-8.

RFC 2049 only states that US-ASCII must be understood, and the same
for the ISO-8859-X charsets, except that you're not required to be
able to display the non-ASCII characters they contain. There's no
mention of UTF-8.

If you have any better references, please provide them. (I do not
claim to have encyclopedic knowledge off the subject.)

Note that the IMC document does not encourage mail clients to produce
UTF-8 by default, it only states that mail clients should be able to
interpret it and given users the option to create messages in UTF-8.
It explicitly recognises that that few mail clients implemented good
UTF-8 support at the time. That was three years ago, and little has
changed since. It is only very recently that good UTF-8 support has
become standard for new clients, and there are still lots and lots of
old clients that have no UTF-8 support at all. It is certainly clear
that the time scale hinted at in the document (that all mail clients
created or revised after 1 January 1999 should be able to interpret
UTF-8) was hopelessly optimistic. We're not there yet, even though
we're getting closer.

>> It's the closest thing that we have to a common _universal_
>> charset.
>
> You sure? Besides ASCII, what other charset can almost everyone read
> (including the people who cut and paste into Unicode editors,
> because they can read it)? There's no other charset (besides ASCII)
> that everyone with a working mailer, no matter how minimal, can
> read.

Well, I'm saying that UTF-8 / Unicode is the closest thing that we
have to a universal charset. (I meant "universal" as in "universal
character repertoire", not "universally supported".) There are many
charsets that are better supported in general than UTF-8; ASCII and
ISO-8859-1 are two of them.

However, the problem in question is not to choose the "best" charset
in general, but to choose the best possible charset for a given
message containing a given set of characters. RFC 2046 states:

   More generally, if a widely-used character set is a subset of
   another character set, and a body contains only characters in the
   widely-used subset, it should be labelled as being in that subset.
   This will increase the chances that the recipient will be able to
   view the resulting entity correctly.

I think this is good advice. Consider the scenario where a group of
people are accustomed to exchanging email in the language of their
choice in a particular charset with little difficulty. Then some
members of the group upgrade their software, and the other members of
the group can then no longer read their messages, since the new
software insists on using UTF-8 (which the older software does not
support). That's bad, and the above advice avoids this situation.

-- 
Gaute Strokkenes                        http://www.srcf.ucam.org/~gs234/
I'm thinking about DIGITAL READ-OUT systems and
 computer-generated IMAGE FORMATIONS..



This archive was generated by hypermail 2.1.2 : Sat Jul 14 2001 - 21:52:09 EDT