Re: Frequent incorrect guesses by the charset autodetection in IE7

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jul 13 2006 - 13:52:14 CDT

  • Next message: James Kass: "Re: Frequent incorrect guesses by the charset autodetection in IE7"

    From: "James Kass" <jameskass@att.net>
    > Philippe Verdy wrote,
    >
    >> An example of website that becomes horrible because of that, or that exhibits
    >> runtime errors in javascripts due to incorrect selection in a page that is
    >> clearly French and with enough text content to confirm this, including the
    >> domain name, and where no CJK charset should be autodetected:
    >> http://www.croix-rouge.fr/
    >> (this is the official web site for the French delegation of the Red Cross).
    >
    > The French Red Cross page has no character set declaration. (Well, it has one,
    > but it is left blank/empty.)

    I know all that, and that's just a common example.

    > The very first character in the HTML file which
    > is not mark-up is &#10010; (U+271A, HEAVY GREEK CROSS). But, that's an NCR
    > which is, of course, in ASCII and shouldn't affect any heuristics regarding
    > character sets.
    >
    > Accented characters called with named HTML references (like &eacute;) display
    > just fine on this page while non-ASCII material seems to display as CJK ideographs.
    >
    > Interestingly, setting the character set to auto-detect in MSIE 6 results in
    > correct display. (I normally operate with auto-detect disabled.)

    That's what I get also with IE6, but apparently this no longer works in IE7 that selects JIS or SJIS;

    > In the absence of a character set declaration in the HTML, why shouldn't
    > a modern browser default to UTF-8? Unicode is the universal character
    > set and UTF-8 its most popular character set in web pages.

    I never said that UTF-8 was selected or that it should be selected (I spoke about JIS and SJIS, but there are some rare cases where EUC is selected too); this is not the problem here. But I really can't understand that IE7 selected JIS without any reasonnable evidence.

    But i also don't understand why IE7 does not keep the manual selection when autodetection is active, within its local cache of pages as an additional metadata, after a bad guess occured.

    The autodetection mechanism is definitely broken, as it even breaks the HTML code and structure (invalid tags generated, script errors, broken links with incorrect syntax, broken javascripts), including at the most basic level (html tags); and it even interpret now invalid JIS codes that are displayed as squared boxes or question marks.

    >> Having to manually select the correct encoding when navigating a large web site
    >> with many pages is really irritating for users...
    >
    > Which is why I normally operate with auto-select disabled. Choose the
    > character set which you expect to encounter most often, set the browser
    > to that character set, and disable auto-select. Pages correctly labelled
    > and served will display in their correct character sets, pages which aren't
    > will display in your selected default.

    > Sounds like they need a volunteer. Perhaps someone who speaks French and
    > appears to have a little spare time?

    I have signaled that to them. But in fact the same is true for the web site of the International Comity of the Red Cross (ICCR.org) in its French and Spanish pages, even if it occurs less often; there also, there's no charset declaration. Volunteering seems not to be an easy option, the web site is managed by a paid employee and there's a strong policy about web sites changes at the Red Cross (even the local delegations cannot work on their pages, agendas and photos directly, but just contribute contents that will be published later.)

    Changing the web site to UTF-8 would be a NO option (too many changes, and not enough resources to verify and reencode the changes, including in ASP pages and database interfaces; remember that this is not a computing organization, their engineering resources are very limited, and they have very limited budgets to update their website, as part of their communication/advertizing costs, as most of their money go to their field action in health care, assistance, and education programs).

    Changing the web server so that it will force a charset in the HTTP header is a NO option (they have pages in Arabic in addition to English, French and Spanish, and sometimes in other languages, depending on their international action, and ermergency programs); the charset needs to be specified page per page, so i think it should be set in some common ASP included at top of pages for generating the common menus, but I can't be sure that this has been applied consistently, and that the structure of the content will not require changing many pages.

    What I was reporting here, is what could have changed in IE7 regarding the charset autodetection, so that it now selects multibyte Asian charsets so much agressively instead of ISO-8859-1 previously. There are several hints that are also not used:
    * the past user selection when browsing in the same domain, or when including attached frames and scripts
    * the page cache to keep the manual selection of pages anduse it in preference to "inherited" charsets or guessed charsets
    * the domain name which indicates possible languages
    * the language itself of the document (detection of common digraphs and trirgrams or words, and the distribution of letters: a true JIS page that contains a few ideographs dispersed in the middle of a very large majority of basic Latin letters is exceptional, even for Japanese)

    I hope this is just a bug in IE7, that will be corrected, or something that has still not been finalized (missing heuristic rules for the autodetection mechanism).



    This archive was generated by hypermail 2.1.5 : Thu Jul 13 2006 - 13:58:50 CDT