Re: charset parameter in Google Groups

From: Philippe Verdy (
Date: Mon Jul 12 2010 - 16:03:51 CDT

  • Next message: Kenneth Whistler: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."

    The problem in this message is probably not in the specified charset
    (windows-1252) but on the way the MIME type is specified just before
    it "TEXT/PLAIN".

    Traditionally, the MIME types are only given in lowercase, so if you
    had written "text/plain; charset=windows-1252", it would have been
    orrectly detected. Google must have thought that the MIME header tag
    was unknown, and it ignored it completely, tryin to guess the charset
    instead. But given that there's very little text in the message, the
    guess algorithm is more likley to fail.

    One of the server may have transcoded the message from windows-1252 to
    UTF-8, but forgot to change the MIME type. Then the Google page just
    detects the windows-1252 encoding that is still present in the MIME
    header, and then displays the message according to it.

    I don't know which strange email program you use that generates this
    form of MIME types, even if they should be interpreted ignoring case.

    And I'm not even sure that Google is the culprit here: it may have
    been caused by an relaying SMTP server not operated by Google but by
    an ISP that transcoded the message without correctly changing the MIME
    header accordingly.

    Or it may have been caused by your own specific SMTP agent before it
    was even sent to the Internet (and I think that this is probably the
    cause of the problem here, because I seriously doubt that a Google
    SMTP relay, or a SMTP relay used by an ISP would alter the internal
    encoding message during the transit, even if it's allowed for
    plain-text contents).

    Note that "windows-1252" is likely to have been transcoded by a SMTP
    server running on Unix/Linux with a broken version, thinking that only
    standard ISO charsets should be admitted, but still ignoring the fact
    that windows-1252 is now becoming the preferred charset over
    ISO-8859-1 (but not over UTF-8).

    Here I see absolutely no relation with the ISO-8859-15 charset (which
    is nearly used by absolutely nobody, when windows-1252 is far superior
    for compatibility as it fully preserves the ISO-8859-1 charset, and
    just maps additional characters within the code area previously
    reserved in ISO-8859-1 for C1 controls that have never been used in
    plain-text emails). Even HTML5 recognizes this fact : ISO-8859-1 is
    being deprecated by windows-1252 for practical reasons

    And there's no reason to ignore this charset, just because it contains
    the term "windows", given that the registration made by Microsoft was
    made in an open way (no problem caused by the trademark citation,
    Microsoft does not want us to display a trademark symbol and a notice
    about its owner), and Google is certainly not deciding to
    discard/ignore this widely used charset.


    > Message du 07/07/10 18:04
    > De : "Andreas Prilop" <>
    > A :
    > Copie :
    > Objet : Re: charset parameter in Google Groups
    > On Tue, 6 Jul 2010, John Dlugosz wrote:
    > > I often see <?> glyps where typesetter chars like curved
    > > apostrophes were supposed to be, or characteristic
    > > UTF-8-as-Latin-1 pairs, in web pages.
    > >
    > > I've seen the charset meta tag overridden with header values
    > > from the server, without regard to what's actually in the file.
    > This means that *your* software (browser) behaves *exactly*
    > in the way I expect for Google, too -- nothing else:
    > Recognize the encoding information (charset) of the document
    > and respect it.
    > If the document has
    > charset=ISO-8859-15
    > then you SHALL apply this charset value.
    > You SHALL NOT look whether the author has a Chinese name,
    > whether the document was published in Japan, etc. etc.
    > Is this clear now?

    This archive was generated by hypermail 2.1.5 : Mon Jul 12 2010 - 16:06:48 CDT