Re: Win IE 7b2 and UTF-8

From: Mark Davis (mark.davis@icu-project.org)
Date: Fri May 12 2006 - 15:42:27 CDT

  • Next message: Karl Pentzlin: "Re: Mysteries in the BMP Roadmap"

    In an ideal world, web pages would be reliably tagged with the correct
    charset. (In an 'idealer' world, all pages would be in well-formed UTF-8.)

    Of course, we are hardly in an ideal world, and programs dealing with web
    pages are faced with a morass of bad data:

       - pages where the server charset doesn't match the page charset (often
       the server stomping on the page's correct charset)
       - pages where the page charset doesn't match the actual bytes (often
       the tools stamp everything with ISO-8859-1, so that label is particularly
       unreliable)
       - pages where the bytes are ill-formed for the charset
       - pages which are a mixture of different charsets (often resulting
       from programmatic composition, eg inserting an ad in a different charset
       than the body of the page).
       - and so on.

    In a somewhat less ideal world, all of the browsers, crawlers, and spiders
    (and various other animals) would have always consistently rejected such
    problem pages. But we're hardly in that world either. If one system can
    interpret malformed pages in some kind of "reasonable" way, then there is
    pressure on the other page consumers to follow suit. So as a practical
    matter, I suspect we will forever see programs applying heuristics to
    provide a 'best fit' interpretation of web page content.

    Mark

    On 5/12/06, Keutgen, Walter <walter.keutgen@be.unisys.com> wrote:
    >
    > Doug,
    >
    > your excellent conclusion lets aside the autodetection.
    >
    > Philippe Verdi wrote:
    >
    > > Doesn't it break or severely limits the encoding autodetection in IE?
    > > This may explain why IE so often displays Chinese characters in the
    > middle
    > > of a French webpage hosted on a server that simply does not specify its
    > > actual encoding: IE returns a false positive match with UTF-8, instead
    > > of identifying the ISO-8859-1 encoding that was actually used.
    > >
    > > This is a severe and very ennoying bug for users (like French users
    > trying
    > > to read webpages that were encoded as ISO-8859-1 but interpreted by
    > default
    > > as UTF-8 as if it was Chinese, even though it would be invalid UTF-8).
    >
    > Microsoft should leave the ill formed UTF-8 sequences aside for the
    > determination of the coded character set.
    >
    > Or alternatively, would it not be simpler to stick to the standards and
    > choose ISO-8859-1 when the HTML source does not provide any charset. More
    > philosophically, is it really better to try making it better than the
    > standards?
    >
    > The reader can still correct by chosing the appropriate encoding. Then
    > Microsoft could satisfy everybody by offering 'UTF-8 strict' and 'UTF-8
    > liberal' or better, if the UTF-8 stream contains ill formed sequences,
    > offering the user to accept them by a pop-up dialogue.
    >
    > Best regards
    >
    > Walter Keutgen
    > Unisys Belgium
    >
    > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY
    > MATERIAL and is thus for use only by the intended recipient. If you received
    > this in error, please contact the sender and delete the e-mail and its
    > attachments from all computers.
    >
    > -----Original Message-----
    > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
    > Behalf Of Doug Ewell
    > Sent: 12 May 2006 17:45
    > To: Unicode Mailing List
    > Subject: Re: Win IE 7b2 and UTF-8
    >
    > Through the years, Microsoft and especially IE have taken a great deal
    > of criticismfor being either too liberal or too consenvative (or both)
    > in what they accept. Whichever they choose, there is sure to be someone
    > waiting in the wings to lambast them for it.
    >
    > IMHO, what Microsoft should do with regard to decoding invalid UTF-8
    > sequences is make a decision, one way or the other, and document that
    > decision openly. That way the debate, and there is sure to be one, will
    > have to focus on the policy and not whether the software is "buggy."
    >
    > My personal preference (RFC 793 notwithstanding) would be for IE to
    > decline to interpret invalid UTF-8, since that is the more secure
    > approach. As Philippe himself pointed out, there's probably not much of
    > this type of data out there. But it is their call.
    >
    > --
    > Doug Ewell
    > Fullerton, California, USA
    > http://users.adelphia.net/~dewell/
    >
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri May 12 2006 - 15:51:39 CDT