Re: Strange Behavior by Win IE 6 displaying bad UTF-8

From: Tom Gewecke (tom@bluesky.org)
Date: Sun Apr 23 2006 - 13:09:04 CST

  • Next message: Richard Wordingham: "Re: Pan-Turkic Alphabet of 1926, Latin letter like U+042C/U+044C or U+0184/U+0185"

    On Apr 23, 2006, at 9:49 AM, Richard Wordingham wrote:

    >
    > It's actually very simple. Given an initial byte E1, the next two
    > bytes must be of the form 10xxxxxx 10xxxxxx. If the parser then
    > trusts alleged UTF-8 to be valid UTF-8 (which it should not), it can
    > then ignore the non-x bits. Now, it is the second and third bytes
    > that are incorrect, being FC and D0 rather than BC and 90, ie. bit 6
    > is 1 whereas it must be 0. The low six bits of FC (wrong) and BC
    > (correct) and D0 (wrong) and 90 (correct) are the same.
    >

    Thanks! This would explain some other weird things I have seen in Win
    Outlook, where invalid byte sequences can get displayed as Chinese
    characters.

    Apparently there is some code around which also generates erroneous
    UTF-8 like this, which is then pretty hard to detect for a Win IE user.

    Any security issues from this ability to read invalid UTF-8 as if it
    were valid?



    This archive was generated by hypermail 2.1.5 : Sun Apr 23 2006 - 13:11:03 CST