Re: charset parameter in Google Groups

From: Asmus Freytag (
Date: Thu Jul 01 2010 - 16:49:07 CDT

  • Next message: Tulasi: "Re: Indian Rupee Sign to be chosen today"

    On 7/1/2010 11:29 AM, John Burger wrote:
    > Andreas Prilop wrote:
    >>> The problem with slavishly following the charset parameter is
    >>> that it is often incorrect.
    >> I wonder how you could draw such a conclusion. In order to make
    >> such a statement, there must be some other (god-given?) parameter,
    >> which is the "real charset".
    > If you have never encountered a web page in which the charset
    > parameter encoded in the page (or in the HTTP headers) did not
    > accurately reflect the "real charset", as indicated by the actual data
    > in the page, then your experience differs sharply from mine, and from
    > everyone else I have ever met.
    Let's unravel this.

    First, there's qualitative vs. quantitative arguments. Yes, mis-tagging
    occurs (for all the reasons Shawn gave in his reply). But Andreas' point
    was that for languages needing more than ASCII, there's a nice
    corrective. If many (most) viewers now base their display on charset,
    then more documents would be expected to be correctly tagged for those
    types of text, because they tend to degrade dramatically otherwise and
    users (authors) would take action to correct the situation. The example
    of this is reading a text as 8859-1 when it is 8859-2 (Eastern European)

    This is different from the issue the issue of selecting the correct
    charset, if it only affects some special symbols (copyright, punctuation
    marks, the euro sign). In these cases, the text degrades in much more
    subtle ways, and usually remains readable. I would expect that the
    incidence of mis-tagging in such a situation is larger. The example for
    this is reading a text as 8859-1 when it was 1252 (Windows code page
    with extra characters not in ISO 8859-1 - Shawn mentioned this case as

    If I were to design a charset-verifier, I would distinguish between
    these two cases. If something came tagged with a region-specific
    charset, I would honor that, unless I found strong evidence of the "this
    can't be right" nature. In some cases, to collect such evidence would
    require significant statistics. The rule here should be "do no harm",
    that is, destroying a document by incorrectly changing a true charset
    should receive a nuch higher penalty than failing to detect a broken
    charset. (That way, you don't penalize people who live by the rules :).

    When it comes to a document tagged with 8859-1, I might relax this
    slightly, as that tag is one of the common default tags and is more
    likely to have been applied blindly.

    When it comes to deciding whether something is Windows code page or a
    true ISO charset, the bar can be set lower - one is a superset of the
    other usually, and detecting any characters in the superset should
    trigger a reassignment. Unlike the other case, the "penalties" for
    getting this wrong are much less severe.


    This archive was generated by hypermail 2.1.5 : Thu Jul 01 2010 - 16:52:06 CDT