Frequent incorrect guesses by the charset autodetection in IE7

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jul 13 2006 - 03:10:36 CDT

  • Next message: James Kass: "Re: Frequent incorrect guesses by the charset autodetection in IE7"

    You may find this is out of topic, but how was IE7 charset autodetection modified so that it now very often makes a lot of false guesses for the autodetection of ISO-8859-1 in French web sites, and instead detects SJIS or some other CJK charset?
    There are lots of wellknown sites that are affected (of course these sites could be updated to set the appropriate charset explicitly, but apparently this occurs on web sites that are generated using development frameworks and various tools (including server side includes and scripts) for managing their content.

    An example of website that becomes horrible because of that, or that exhibits runtime errors in javascripts due to incorrect selection in a page that is clearly French and with enough text content to confirm this, including the domain name, and where no CJK charset should be autodetected:
    http://www.croix-rouge.fr/
    (this is the official web site for the French delegation of the Red Cross).

    I could quote a lot of other websites, but as they are commercial by nature, I don't want to give ads for them on this list (that's why I selected this wellknown worldwide non-profit humanitary organization). Commercial organisation will likely adapt their web sites to avoid this error, but non-profit organization often lack the money and internal development team to make such corrections in what could be a nightmare for them to handle (and they better invest the time of their benevols and money resources within their humanitary missions).

    I tried to look into Microsoft's MSDN and knowledge base to see what was changed in ie7 autodetection mechanims that causes it to make much more false guesses on French pages most often encoded with ISO-8859-1, but there's nothing specific; I suppose that microsoft has changed some detection thresholds for Asian users that have difficulties to determine which Asian charset to select, but how can it affect pages that are using very common languages on the web such as French

    It does not seem to occur so much with German, Spanish, Italian pages (which are also frequently encoded with ISO-8859-1, but I don't see any compelling reason about this worsened autodetection, except if the tables of frequent digraphs/trigraphs and frequent words for the heuristic detection of language+charset tuples was removed from IE7, or if these tables have been recreated but still not tuned, or if there's a future integration within the Windows indexing engine which seems to detect the language+charset tuple correctly (unlike in IE), through a reusable new API still not used and implemented in IE7 BETA).

    I have noted that this occasionally also affects French pages encoded with UTF-8 as well (also detected as SJIS and, more rarely, as ISO-8859-1.

    Having to manually select the correct encoding when navigating a large web site with many pages is really irritating for users (why doesn't IE consider the selected encoding of the previous page when navigating across pages of the same domain, when the same multiple encodings are possible candidates for the autodetection heuristic?)



    This archive was generated by hypermail 2.1.5 : Thu Jul 13 2006 - 03:19:26 CDT