Re: Strange Behavior by Win IE 6 displaying bad UTF-8

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sun Apr 23 2006 - 10:49:27 CST

  • Next message: Karl Pentzlin: "Pan-Turkic Alphabet of 1926, Latin letter like U+042C/U+044C or U+0184/U+0185"

    Tom Gewecke wrote on Sunday, April 23, 2006 at 3:15 PM

    > Recently I tried to view a web site in UTF-8 polytonic Greek, but it came
    > up with a lot of question marks on all my Mac OS X browsers. So I had a
    > close look at the source and sure enough, the UTF-8 bytes for the Greek
    > Extended chars seemed to be wrong. When I asked the site author, he
    > said everything displayed fine in Win IE. So I tried that and, sure
    > enough, it did, despite the UTF-8 bytes appearing to be wrong.
    >
    > Can anyone tell me how this can happen? (Note: My version of Win IE 6
    > is somewhat old, from 2001).

    It still displays with mine, and I am up-to-date on the Windows XP security
    patches. Firefox also detects the error, so it is only a problem with
    browsers that check UTF-8 for validity.

    The bad byte sequence (as stated on the test page) is E1 FC D0, instead of
    E1 BC 90.

    It's actually very simple. Given an initial byte E1, the next two bytes
    must be of the form 10xxxxxx 10xxxxxx. If the parser then trusts alleged
    UTF-8 to be valid UTF-8 (which it should not), it can then ignore the non-x
    bits. Now, it is the second and third bytes that are incorrect, being FC
    and D0 rather than BC and 90, ie. bit 6 is 1 whereas it must be 0. The low
    six bits of FC (wrong) and BC (correct) and D0 (wrong) and 90 (correct) are
    the same.

    Richard.



    This archive was generated by hypermail 2.1.5 : Sun Apr 23 2006 - 10:53:33 CST