Re: charset parameter in Google Groups

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jul 12 2010 - 16:03:51 CDT

Next message: Kenneth Whistler: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."

Previous message: Michael Everson: "Re: Bengali Script"
Maybe in reply to: Andreas Prilop: "Re: charset parameter in Google Groups"
Next in thread: Mark Crispin: "Re: charset parameter in Google Groups"
Reply: Mark Crispin: "Re: charset parameter in Google Groups"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

The problem in this message is probably not in the specified charset
(windows-1252) but on the way the MIME type is specified just before
it "TEXT/PLAIN".

Traditionally, the MIME types are only given in lowercase, so if you
had written "text/plain; charset=windows-1252", it would have been
orrectly detected. Google must have thought that the MIME header tag
was unknown, and it ignored it completely, tryin to guess the charset
instead. But given that there's very little text in the message, the
guess algorithm is more likley to fail.

One of the server may have transcoded the message from windows-1252 to
UTF-8, but forgot to change the MIME type. Then the Google page just
detects the windows-1252 encoding that is still present in the MIME
header, and then displays the message according to it.

I don't know which strange email program you use that generates this
form of MIME types, even if they should be interpreted ignoring case.

And I'm not even sure that Google is the culprit here: it may have
been caused by an relaying SMTP server not operated by Google but by
an ISP that transcoded the message without correctly changing the MIME
header accordingly.

Or it may have been caused by your own specific SMTP agent before it
was even sent to the Internet (and I think that this is probably the
cause of the problem here, because I seriously doubt that a Google
SMTP relay, or a SMTP relay used by an ISP would alter the internal
encoding message during the transit, even if it's allowed for
plain-text contents).

Note that "windows-1252" is likely to have been transcoded by a SMTP
server running on Unix/Linux with a broken version, thinking that only
standard ISO charsets should be admitted, but still ignoring the fact
that windows-1252 is now becoming the preferred charset over
ISO-8859-1 (but not over UTF-8).

Here I see absolutely no relation with the ISO-8859-15 charset (which
is nearly used by absolutely nobody, when windows-1252 is far superior
for compatibility as it fully preserves the ISO-8859-1 charset, and
just maps additional characters within the code area previously
reserved in ISO-8859-1 for C1 controls that have never been used in
plain-text emails). Even HTML5 recognizes this fact : ISO-8859-1 is
being deprecated by windows-1252 for practical reasons

And there's no reason to ignore this charset, just because it contains
the term "windows", given that the registration made by Microsoft was
made in an open way (no problem caused by the trademark citation,
Microsoft does not want us to display a trademark symbol and a notice
about its owner), and Google is certainly not deciding to
discard/ignore this widely used charset.

Philippe.

> Message du 07/07/10 18:04
> De : "Andreas Prilop" <prilop4321@trashmail.net>
> A : unicode@unicode.org
> Copie à :
> Objet : Re: charset parameter in Google Groups
>
>
> On Tue, 6 Jul 2010, John Dlugosz wrote:
>
> > I often see <?> glyps where typesetter chars like curved
> > apostrophes were supposed to be, or characteristic
> > UTF-8-as-Latin-1 pairs, in web pages.
> >
> > I've seen the charset meta tag overridden with header values
> > from the server, without regard to what's actually in the file.
>
> This means that *your* software (browser) behaves *exactly*
> in the way I expect for Google, too -- nothing else:
>
> Recognize the encoding information (charset) of the document
> and respect it.
>
> If the document has
> charset=ISO-8859-15
> then you SHALL apply this charset value.
>
> You SHALL NOT look whether the author has a Chinese name,
> whether the document was published in Japan, etc. etc.
>
> Is this clear now?

Next message: Kenneth Whistler: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."
Previous message: Michael Everson: "Re: Bengali Script"
Maybe in reply to: Andreas Prilop: "Re: charset parameter in Google Groups"
Next in thread: Mark Crispin: "Re: charset parameter in Google Groups"
Reply: Mark Crispin: "Re: charset parameter in Google Groups"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jul 12 2010 - 16:06:48 CDT