Re: Win IE 7b2 and UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 11 2006 - 18:37:01 CDT

  • Next message: David Faulks: "Mysteries in the BMP Roadmap"

    From: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
    > On Thu, 11 May 2006, Tom Gewecke wrote:
    >
    >> If anyone on the list is running Win IE 7b2, could they let me know whether
    >> it also has IE 6's behavior of displaying bad UTF-8 as if it were correct?
    >> The test page is
    >>
    >> http://homepage.mac.com/thgewecke/badutf8.html
    >
    > Yes it has. (Tested on 7.0.5346.5.)

    I have the same reply.

    Is there any reason for IE6/IE7 to be such much "tolerant" about invalid UTF-8? Are there really lots of processes that produce documents encoded with invalid UTF-8? I don't know any one (not even from Microsoft itself).

    Doesn't it break or severely limits the encoding autodetection in IE? This may explain why IE so often displays Chinese characters in the middle of a French webpage hosted on a server that simply does not specify its actual encoding: IE returns a false positive match with UTF-8, instead of identifying the ISO-8859-1 encoding that was actually used.

    This is a severe and very ennoying bug for users (like French users trying to read webpages that were encoded as ISO-8859-1 but interpreted by default as UTF-8 as if it was Chinese, even though it would be invalid UTF-8).

    And even if the server pretends that the webpage was encoded with UTF-8, IE should still apply the strict rules, and then either:
    - reject the document (asking theuser to select another encoding, or to try to reload it)
    - display "missing glyph" squares for all invalid bytes found in the decoded document, possibly replacing the sequences with a substitution character like U+FFFD or replacing them with some codepoints in the PUA ranges, or even better IE should raise an exception without returning any codepoint (invalid bytes should not be left exposed "as is" in APIs like Javascript and XML DOM that are interpreting sequences of codepoints).

    These solutions should be applied similarily to unpaired surrogates.

    Regarding XML or HTML conformance, such document is also invalid as it does not decode correctly into a valid sequence of codepoints (according to W3C standards that describes which characters are valid in the document). This also means notably that Javascripts should not run if they contain such invalid byte sequences, unless a encoding successfully decodes the byte stream into valid sequence of encoded characters according to the encoding and its mapping to Unicode.



    This archive was generated by hypermail 2.1.5 : Thu May 11 2006 - 18:41:48 CDT