Re: (Not really?) Unicode question

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Wed Sep 27 2006 - 16:25:35 CST

  • Next message: Andrew West: "Re: non-IPA primary/secondary stress marks?"

    On Wed, 27 Sep 2006, Stephane Bortzmeyer wrote:

    > It does not seem to be really Unicode-related. If only "legacy"
    > charsets existed, we would have the same problem. It would disappear
    > only if everyone used Unicode *and* the same encoding.

    That is true, but Unicode has made the problem important. People
    have used to send email containing either just ASCII or characters
    in an encoding widely used in some community. Unicode makes us expect
    that we can send email containing all kinds of characters, and we
    often get disappointed.

    > The thread on the KDE bug tracking system seems to be comprehensive
    > and the kmail people replied to all his questions, I believe.

    That sounds fine. Many other problems remain. I'm not sure whether
    a useful collection of information on Unicode email can be composed, but it
    would surely be needed. It would be needed at least at two levels: for end
    users, and for IT staff. My book "Unicode Explained" tries to discuss
    these sides, but at a fairly general level and with just a few common
    examples. Especially the status of Unicode support in email software and
    systems is something that should be documented in an online document, and
    there's actually something useful-looking in wikipedia, e.g. at
    http://en.wikipedia.org/wiki/Comparison_of_e-mail_clients
    but I'm afraid the inherent instability of wikipedia pages makes the
    situation problematic. Besides, information on UTF-8 support isn't really
    enough.

    Even though most email clients support Unicode at some level, the
    situation is much worse with web-based email use (webmail), which is
    becoming increasingly important. Besides, there are many pitfalls even in
    common email clients. For example: assuming your Outlook Express is set to
    use ISO-8859-1 by default and you enter characters in Windows Latin 1
    (windows-1252) but not in ISO-8859-1, it does _not_ automatically switch
    to UTF-8 as it typically does when the default encoding won't do; instead
    it silently replaces the extra characters by some ASCII characters.
    Another example: Thunderbird has a setting for ignoring MIME headers and
    using the default encoding for interpreting all incoming messages. This is
    quite shocking and it may thoroughly mess things up if someone sends you
    UTF-8 email.

    Whether the Unicode Consortium's web page should contain a compilation of
    such information is debatable. If the answer will be "yes", I hope it will
    be based on a decision to _maintain_ such information.

    >> - Why is it that some emails in < any foreign script> display
    >> correctly while others just appear as squares and interrogation
    >> marks?
    >
    > Because, in a world where there are many character sets and many
    > encodings, an email MUST be tagged with the proper charset (MIME calls
    > "charset" what is actually an encoding). If it is not properly tagged,
    > it will not be displayed properly. If it is untagged, it depends on
    > some local default (ISO 8859-1 in my case, for instance).

    That's a partial explanation. Non-ASCII email _may_ work without proper
    MIME headers, but of course nobody should rely on that. In practice, even
    end users may need information on how to try to use different encodings
    "manually" (unless they decide to discard email that lacks appropriate
    headers, which is somewhat drastic).

    Even with proper headers, there's the problem that the email may contain
    characters that are not present in the fonts that the email client uses.
    This is not very common these days any more, but it's certainly possible.

    There are three basic ways to send Unicode email:
    1) as plain text, using a Unicode encoding (UTF-8, in practice) in the
    message body, with appropriate MIME headers
    2) as an attachment, such as an MS Word document
    3) in HTML format (also known as "RTF" in Microsoft documentation!),
    which may mean that non-ASCII characters are represented using character
    references, thereby avoiding some problems.

    All of these have pros and cons. I have seen, for example, method 1 fail
    but method 3 work, surprising as it may sound.

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Wed Sep 27 2006 - 16:28:09 CST