Re: (Not really?) Unicode question

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Wed Sep 27 2006 - 16:25:35 CST

Next message: Andrew West: "Re: non-IPA primary/secondary stress marks?"

Previous message: Jefsey_Morfin: "Re: Unicode & space in programming & l10n"
In reply to: Stephane Bortzmeyer: "Re: (Not really?) Unicode question"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Wed, 27 Sep 2006, Stephane Bortzmeyer wrote:

> It does not seem to be really Unicode-related. If only "legacy"
> charsets existed, we would have the same problem. It would disappear
> only if everyone used Unicode *and* the same encoding.

That is true, but Unicode has made the problem important. People
have used to send email containing either just ASCII or characters
in an encoding widely used in some community. Unicode makes us expect
that we can send email containing all kinds of characters, and we
often get disappointed.

> The thread on the KDE bug tracking system seems to be comprehensive
> and the kmail people replied to all his questions, I believe.

That sounds fine. Many other problems remain. I'm not sure whether
a useful collection of information on Unicode email can be composed, but it
would surely be needed. It would be needed at least at two levels: for end
users, and for IT staff. My book "Unicode Explained" tries to discuss
these sides, but at a fairly general level and with just a few common
examples. Especially the status of Unicode support in email software and
systems is something that should be documented in an online document, and
there's actually something useful-looking in wikipedia, e.g. at
http://en.wikipedia.org/wiki/Comparison_of_e-mail_clients
but I'm afraid the inherent instability of wikipedia pages makes the
situation problematic. Besides, information on UTF-8 support isn't really
enough.

Even though most email clients support Unicode at some level, the
situation is much worse with web-based email use (webmail), which is
becoming increasingly important. Besides, there are many pitfalls even in
common email clients. For example: assuming your Outlook Express is set to
use ISO-8859-1 by default and you enter characters in Windows Latin 1
(windows-1252) but not in ISO-8859-1, it does _not_ automatically switch
to UTF-8 as it typically does when the default encoding won't do; instead
it silently replaces the extra characters by some ASCII characters.
Another example: Thunderbird has a setting for ignoring MIME headers and
using the default encoding for interpreting all incoming messages. This is
quite shocking and it may thoroughly mess things up if someone sends you
UTF-8 email.

Whether the Unicode Consortium's web page should contain a compilation of
such information is debatable. If the answer will be "yes", I hope it will
be based on a decision to _maintain_ such information.

>> - Why is it that some emails in < any foreign script> display
>> correctly while others just appear as squares and interrogation
>> marks?
>
> Because, in a world where there are many character sets and many
> encodings, an email MUST be tagged with the proper charset (MIME calls
> "charset" what is actually an encoding). If it is not properly tagged,
> it will not be displayed properly. If it is untagged, it depends on
> some local default (ISO 8859-1 in my case, for instance).

That's a partial explanation. Non-ASCII email _may_ work without proper
MIME headers, but of course nobody should rely on that. In practice, even
end users may need information on how to try to use different encodings
"manually" (unless they decide to discard email that lacks appropriate
headers, which is somewhat drastic).

Even with proper headers, there's the problem that the email may contain
characters that are not present in the fonts that the email client uses.
This is not very common these days any more, but it's certainly possible.

There are three basic ways to send Unicode email:
1) as plain text, using a Unicode encoding (UTF-8, in practice) in the
message body, with appropriate MIME headers
2) as an attachment, such as an MS Word document
3) in HTML format (also known as "RTF" in Microsoft documentation!),
which may mean that non-ASCII characters are represented using character
references, thereby avoiding some problems.

All of these have pros and cons. I have seen, for example, method 1 fail
but method 3 work, surprising as it may sound.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Next message: Andrew West: "Re: non-IPA primary/secondary stress marks?"
Previous message: Jefsey_Morfin: "Re: Unicode & space in programming & l10n"
In reply to: Stephane Bortzmeyer: "Re: (Not really?) Unicode question"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Sep 27 2006 - 16:28:09 CST