Re: Strange Behavior by Win IE 6 displaying bad UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Apr 25 2006 - 22:55:57 CST

  • Next message: Philippe Verdy: "Re: Unicode fonts"

    Did you perform the test on (invalid UTF-8) C0 8A? Does IE6 treat it like a linefeed 0A (i.e. like in obsolete RFC UTF-8) ? (or also C0 A0 like a space 20...)

    I know that C0 80 may be treated like 00 and discarded or rejected as invalid. That looks like a serious issue if those pseudo-equivalents are not at least treated either completely equivalent throughout the application or completely invalid everywhere, because it could open security risks.

    For me, UTF-8 is the strictest definition where only the vlid sequence isaccepted and all others are rejected and make the document invalid. A decoder that accepts a few pseudo-equivalents looks bad, given that it will interact unpredicatably with other similar parsers using different lacist ruls with more or less accepted sequences.

    If Windows wants to support a laxist decoder, it should document and implement a distinct charset with a distinct label (OBSOLETE-UTF-8 for example) and implement it so that it will honor ALL possible pseudo-equivalents, not just some. Then the software will correctly report if the document is UTF-8 or OBSOLETE-UTF-8 to other applications, but should NEVER indicate that the document was parsed successfully using UTF-8 if the document was not completely adhering to the standard. It should NOT attempt to correct the encoding and SHOULD indicate to other applications that even if the document could be decoded successfully, it was not originately using the UTF-8 standard.

    Inaddition, an application that attempts to indicate to IE that the document it passes to it should be decoded with UTF-8 should never be decoded in the laxist mode third party applications could retry by attempting the laxist mode using a distinct charset specifier (not "UTF-8"), so that iE will know that the third party application is prepared to handled nonconforming documents using this alternate charset:

    There are cases where a document is passed to IE components for encoding validation (or determination), and if IE accepts the document with error despite it is not conforming, then it is lying to other apps that may take the wrong decision about howto handle the document.

    This laxist mode looks like a security risk (open door, breach) that may be used as an exploit in viral malwares tryingto defeat for example content filters or code checkers (parsers that try to evaluate the security of some active Javascript functions for example).

    So I hope that those invalid pseudo-equivalents do not work (are not accepted) at least for the ASCII range. But what about now the handling of the newline C1 control (NEL) that is part of the ISO-8859-1 range? Normally it is encoded on two bytes, but what you have found indicates that IE would also accept a second invalid 2-bytes sequence (using a lead byte where a trailing byte is expected).

    ----- Original Message -----
    From: "Tom Gewecke" <tom@bluesky.org>
    To: <unicode@unicode.org>
    Sent: Monday, April 24, 2006 10:17 PM
    Subject: Strange Behavior by Win IE 6 displaying bad UTF-8

    > From my experiments, and based on Richard Wordingham's explanation, it
    > appears that Windows IE 6 considers the following 4 UTF-8 sequences
    > completely equivalent as far as display is concerned: E1 BC 90, E1 FC
    > D0, E1 FC 90, E1 BC D0. Only the first is valid UTF-8, but they all
    > produce the same character. Presumably there many other similar sets
    > like this.



    This archive was generated by hypermail 2.1.5 : Tue Apr 25 2006 - 23:01:46 CST