Re: Strange Behavior by Win IE 6 displaying bad UTF-8

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sun Apr 23 2006 - 10:49:27 CST

Next message: Karl Pentzlin: "Pan-Turkic Alphabet of 1926, Latin letter like U+042C/U+044C or U+0184/U+0185"

Previous message: Doug Ewell: "Re: Help in Drafting a Proposal"
In reply to: Tom Gewecke: "Strange Behavior by Win IE 6 displaying bad UTF-8"
Next in thread: Tom Gewecke: "Re: Strange Behavior by Win IE 6 displaying bad UTF-8"
Reply: Tom Gewecke: "Re: Strange Behavior by Win IE 6 displaying bad UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Tom Gewecke wrote on Sunday, April 23, 2006 at 3:15 PM

> Recently I tried to view a web site in UTF-8 polytonic Greek, but it came
> up with a lot of question marks on all my Mac OS X browsers. So I had a
> close look at the source and sure enough, the UTF-8 bytes for the Greek
> Extended chars seemed to be wrong. When I asked the site author, he
> said everything displayed fine in Win IE. So I tried that and, sure
> enough, it did, despite the UTF-8 bytes appearing to be wrong.
>
> Can anyone tell me how this can happen? (Note: My version of Win IE 6
> is somewhat old, from 2001).

It still displays with mine, and I am up-to-date on the Windows XP security
patches. Firefox also detects the error, so it is only a problem with
browsers that check UTF-8 for validity.

The bad byte sequence (as stated on the test page) is E1 FC D0, instead of
E1 BC 90.

It's actually very simple. Given an initial byte E1, the next two bytes
must be of the form 10xxxxxx 10xxxxxx. If the parser then trusts alleged
UTF-8 to be valid UTF-8 (which it should not), it can then ignore the non-x
bits. Now, it is the second and third bytes that are incorrect, being FC
and D0 rather than BC and 90, ie. bit 6 is 1 whereas it must be 0. The low
six bits of FC (wrong) and BC (correct) and D0 (wrong) and 90 (correct) are
the same.

Richard.

Next message: Karl Pentzlin: "Pan-Turkic Alphabet of 1926, Latin letter like U+042C/U+044C or U+0184/U+0185"
Previous message: Doug Ewell: "Re: Help in Drafting a Proposal"
In reply to: Tom Gewecke: "Strange Behavior by Win IE 6 displaying bad UTF-8"
Next in thread: Tom Gewecke: "Re: Strange Behavior by Win IE 6 displaying bad UTF-8"
Reply: Tom Gewecke: "Re: Strange Behavior by Win IE 6 displaying bad UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Apr 23 2006 - 10:53:33 CST