Re: Emoji [And crash in the Web interface to the mailing list] from Philippe Verdy on 2014-04-04 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Fri, 4 Apr 2014 07:45:08 +0200

The content is transfered as UTF-8 at the MIME level for both the
plain-text and HTML parts attached:

--_000_76A14357762B4A06BD1E68CB3C8C452Dgluesoftcojp_

Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64

...

--_000_76A14357762B4A06BD1E68CB3C8C452Dgluesoftcojp_
Content-Type: text/html; charset="utf-8"Content-Transfer-Encoding: base64

...

Normally the xml declaration or meta tag in the (X)HTML headers should be
ignored, mail agents will not transform the attachments except possibly
changing the content-transfer-encoding (here base64 in both parts). If you
mail agents does not process the HTML part, it will render the plain-text
verson which has no declaration at all, the MIME content-type will then be
the only indication.

I don't know how the sending email agent could generate "us-ascii" in XHTML
headers, but in fact it should simply be discarded in all cases (in HTML5,
us-ascii or iso-8859-1 and their aliases are normally all treated like
windows-1252, and "us-ascii" is simply ignored, it is bugged by itself in
almost all cases).

But here we are not in an HTML5 case; so once the HTML headers are
discarded, the next candidate is the MIME part declaration (the
transporting layer). As it specifies UTF-8, this should work without
forcing the reading email agent to start using its "encoding guessing"
magic. But even in this case, UTF-8 is certainly a better guess than Iegacy
japanese charsets (using various settings in the reding mail agent such as
specifying a prefered default encoding to use one of these legacy charsets
has no effect UTF-8 is always used to process the message.

So those that see bugs are affected if:
- a user sending is message with an outdated and bugged email agent for
composing and sending the mail (which inserts compeltely incorrect XHTML
headers)
- recipients use themselves an outdated and bugged email agent, not
performing the most reasonable processing and guessing steps (or this
behavior can only be reproduced by those using a email agent whose user
localisation is Japanese).

The encoding guesser here is most probably bugged but affected by the fact
that there are not enough contextual content to guess it with good
confidence (only a few isolated characters whose use here was discretionary
and extremely rare in an English text conten, those few characters have
near-zero confidence value in English as long as there's no other East
Asian language used).

It looks like the reading email agent does not reach a minimum threshold
level of confidence for the guessed encoding; so it eems that the result of
the guesser is simply discarded, and then the reading email agent only uses
the default user setting of the encoding to use to process messages with
unknown/unspecified encodings. I'm not sure this is valid to discard the
UTF-8 explicit MIME declaration which does not come from the encoding
guesser, as UTF-8 is now a solid default to use (a default now for almost
all new IETF standards since long now, with now a wide majority of software
installations using it effectively as ther default), notably when it is
specified as here).

We know that UTF-8 is now the best guess for content at the *worldwide*
level. But is UTF-8 still a minority encoding for contents exchanged in
Japan ? The ISO2022-JP seems very unlikely to be used instead of UTF-8, and
I would have possibly expected a shift-JIS variant instead, if Unicode is
still not the best choice for Japan. But if the email agent is on a now
antique OS (Windows XP or 2000 ? themselves installed with in their
Japanese localisation) may be that user never updated its agent for that
old OS (and it is quite surprising for Japan that like to use the newest
technology products, except if the reading user is using a tricky
installation with lots of personal system settings for their "geek" tools
that have never been ported to newer OSes).

In my opinion we are in an extremely user-specific situation. But I do not
see where the mailing list was acting incorrectly (it won't change its
settings only for a few "geeks" with tricky installations and using antique
softwares).

2014-04-04 6:48 GMT+02:00 Koji Ishii <kojiishi_at_gluesoft.co.jp>:

> Go to Encoding menu and choose UTF-8 to fix the garbled characters.
>
> It looks like the page is served in UTF-8, but it declares itself as
> us-ascii:
> <?xml version="1.0" encoding="us-ascii$B!I(B?>
> and
> <meta http-equiv="Content-Type" content="text/html; charset=us-ascii" />
>
> /koji
>
> On Apr 4, 2014, at 8:16 AM, Buck Golemon <buck_at_yelp.com> wrote:
>
> I too received the intended emoji via direct email but I see the garbled
> characters in the web interface:
>
> $B!3(B($B!1'U!1(B;)$B%N(B - worried
>
>
>
> $B!4(B($B!w!,"&!,!w!K%N(B - happy
>
> $B!3(B(#`$B'%!-(B)$B%N(B - angry
>
> $B!Z!&(B_$B!&(B?$B![(B- confused
>
>
> I believe there is an encoding issue somewhere in the
> unicode.org/mail-arch toolchain.
>
>
>

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Fri Apr 04 2014 - 00:46:31 CDT

This archive was generated by hypermail 2.2.0 : Fri Apr 04 2014 - 00:46:31 CDT