Re: Frequent incorrect guesses by the charset autodetection in IE7

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jul 13 2006 - 19:54:42 CDT

  • Next message: Philippe Verdy: "Re: Frequent incorrect guesses by the charset autodetection in IE7"

    From: "James Kass" <jameskass@att.net>
    > Philippe Verdy wrote,
    >
    >> The autodetection mechanism is definitely broken, as it even breaks the HTML
    >> code and structure (invalid tags generated, script errors, broken links with
    >> incorrect syntax, broken javascripts), including at the most basic level (html
    >> tags); and it even interpret now invalid JIS codes that are displayed as squared
    >> boxes or question marks.
    >
    > The autodetection mechanism may be broken, but it can't really be blamed
    > for breaking the HTML code and structure. Without a character set
    > declaration, the HTML code is already broken. No HTML validator should
    > pass such a page.

    Why that? the HTML code is correct, except when parsed with a multibyte charset, which should not occur as this is not declared, and also which should be detected by the heurisitc mechanism when it attempts to identify the charset.

    Note that the page does not specify the dtd version, this is then to be parsed valid according to legacy HTML 3.2, and without the charset specification, an ISO 8859-based charset should be used. Using ISO 8859 makes no parsing error. Give me only one sentence in the HTML specs that says that the charset indication is mandatory! In legacy HTML 3.2, ISO 8859-1 is even a charset whose support is required, as confirmed in the normative DTDs, and the normative list of named entities.

    Note that I don't care about this particular site. But there are TONS of web sites made this way, and the autodetection mechanism that fails so often for handling tons of existing web sites is certainly a severe compatibility issue (and a regression bug if it breaks now in IE7, but not in IE6).

    What I want to warn is the fact that the autodetection mechanism in IE7 (at least the latest public beta) is severely broken now, and only works well with Asian charsets, but European charsets have been neglected despite the fact that they are making a vast majority of all existing content on the web.

    The autodetection should not break pages using standard UTF-8 or ISO 8859 charsets. There are simple ways to detect UTF-8 now, so that it can be attempted first, before trying parsing with ISO 8859 rules, and then refining this latter charset family. Legacy national Multibyte charsets for asia need more strict heuristics, and if this may degrade the detection of ISO8859 based charsets, then additional heuristic tables must be taken into account. It seems that the IE7 heuristic evaluates ONLY the possibility of a legacy Asian charset without balancing it with the possibility of an European charset.

    When balancing charset evaluations, there are also hints that are ignored: the domain name (in national ccTLD), the text content itself (there are very frequent words which are real hints of the language used, and French is almost never encoded using JIS, except on Japanese servers that embed some French language within a larger content in Japanese). Even if JIS was really a candidate, there's absolutely nothing in the page content to confirm that it contains Japanese (so the JIS detection thresholds are wrongor missing, or there's an implementation bug in IE7 during the heuristic evaluation)

    But more importantly, IE7 does not remember the corrected charset that has been selected by the user when a false guess occured. Isn't the local web cache used to store such similar data? (pages, cookies, navigation history, cached form fields, and so on) Why not storing the effective charset that was last selected for a page or domain name, and using it as a strong hint about which charset to return?

    This makes some web sites really unusable (also, i don't understand why IE has to requery the server to get the page: selecting a new charset performs a new request and this is a severe problem for interacting with web sites that limit the navigation or perform transactions or changes of user records during the navigation; this clearly does not work with secured web sites).



    This archive was generated by hypermail 2.1.5 : Thu Jul 13 2006 - 19:59:20 CDT