Re: charset parameter in Google Groups

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Fri Jul 02 2010 - 12:04:26 CDT

  • Next message: announcements@unicode.org: "Unicode welcomes Government of Bangladesh"

    John Burger wrote:

    > Asmus distinguishes between two kinds of cases: The first is guessing
    > the charset incorrectly in a way that completely degrades the text,
    > e.g. 8859-1 vs. 8859-2. Second is a more subtle kind of mistakes, and
    > arguably much less objectionable, e.g., 8859-1 vs. 1252, or the "smart
    > quotes" problem.

    I'd say the key distinction is between protocol-incorrect behavior (like
    ignoring a character encoding specified properly, due to some “heuristics”)
    and error handling. If a document is declared as ISO-8859-1 encoded, it is
    protocol-incorrect to treat it as anything else, if all the octets are
    defined in ISO-8859-1 and allowed in the data format. However, an HTML 4.01
    document declared as ISO-8859-1 encoded an containing, say, octet 80
    (hexadecimal) is by definition malformed. A browser may decide to refuse to
    display it at all (not a good decision in practice) or to perform some error
    correction, like interpreting the data as windows-1252 encoded instead.

    > I like this distinction, and would point out that we can probably
    > quantify this into a continuum,

    No, I think this requires discretion. Incorrect behavior vs. error handling
    (which may vary, though strong arguments may favor one or another approach).

    If you ask me, error recovery should be signalled to end user, though
    perhaps discretely (pun intended) in cases where it seems “obvious”.

    -- 
    Yucca, http://www.cs.tut.fi/~jkorpela/ 
    


    This archive was generated by hypermail 2.1.5 : Fri Jul 02 2010 - 12:09:24 CDT