From: Mark Davis (email@example.com)
Date: Fri May 12 2006 - 15:42:27 CDT
In an ideal world, web pages would be reliably tagged with the correct
charset. (In an 'idealer' world, all pages would be in well-formed UTF-8.)
Of course, we are hardly in an ideal world, and programs dealing with web
pages are faced with a morass of bad data:
- pages where the server charset doesn't match the page charset (often
the server stomping on the page's correct charset)
- pages where the page charset doesn't match the actual bytes (often
the tools stamp everything with ISO-8859-1, so that label is particularly
- pages where the bytes are ill-formed for the charset
- pages which are a mixture of different charsets (often resulting
from programmatic composition, eg inserting an ad in a different charset
than the body of the page).
- and so on.
In a somewhat less ideal world, all of the browsers, crawlers, and spiders
(and various other animals) would have always consistently rejected such
problem pages. But we're hardly in that world either. If one system can
interpret malformed pages in some kind of "reasonable" way, then there is
pressure on the other page consumers to follow suit. So as a practical
matter, I suspect we will forever see programs applying heuristics to
provide a 'best fit' interpretation of web page content.
On 5/12/06, Keutgen, Walter <firstname.lastname@example.org> wrote:
> your excellent conclusion lets aside the autodetection.
> Philippe Verdi wrote:
> > Doesn't it break or severely limits the encoding autodetection in IE?
> > This may explain why IE so often displays Chinese characters in the
> > of a French webpage hosted on a server that simply does not specify its
> > actual encoding: IE returns a false positive match with UTF-8, instead
> > of identifying the ISO-8859-1 encoding that was actually used.
> > This is a severe and very ennoying bug for users (like French users
> > to read webpages that were encoded as ISO-8859-1 but interpreted by
> > as UTF-8 as if it was Chinese, even though it would be invalid UTF-8).
> Microsoft should leave the ill formed UTF-8 sequences aside for the
> determination of the coded character set.
> Or alternatively, would it not be simpler to stick to the standards and
> choose ISO-8859-1 when the HTML source does not provide any charset. More
> philosophically, is it really better to try making it better than the
> The reader can still correct by chosing the appropriate encoding. Then
> Microsoft could satisfy everybody by offering 'UTF-8 strict' and 'UTF-8
> liberal' or better, if the UTF-8 stream contains ill formed sequences,
> offering the user to accept them by a pop-up dialogue.
> Best regards
> Walter Keutgen
> Unisys Belgium
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY
> MATERIAL and is thus for use only by the intended recipient. If you received
> this in error, please contact the sender and delete the e-mail and its
> attachments from all computers.
> -----Original Message-----
> From: email@example.com [mailto:firstname.lastname@example.org] On
> Behalf Of Doug Ewell
> Sent: 12 May 2006 17:45
> To: Unicode Mailing List
> Subject: Re: Win IE 7b2 and UTF-8
> Through the years, Microsoft and especially IE have taken a great deal
> of criticismfor being either too liberal or too consenvative (or both)
> in what they accept. Whichever they choose, there is sure to be someone
> waiting in the wings to lambast them for it.
> IMHO, what Microsoft should do with regard to decoding invalid UTF-8
> sequences is make a decision, one way or the other, and document that
> decision openly. That way the debate, and there is sure to be one, will
> have to focus on the policy and not whether the software is "buggy."
> My personal preference (RFC 793 notwithstanding) would be for IE to
> decline to interpret invalid UTF-8, since that is the more secure
> approach. As Philippe himself pointed out, there's probably not much of
> this type of data out there. But it is their call.
> Doug Ewell
> Fullerton, California, USA
This archive was generated by hypermail 2.1.5 : Fri May 12 2006 - 15:51:39 CDT