From: Richard Wordingham (email@example.com)
Date: Sun Apr 23 2006 - 10:49:27 CST
Tom Gewecke wrote on Sunday, April 23, 2006 at 3:15 PM
> Recently I tried to view a web site in UTF-8 polytonic Greek, but it came
> up with a lot of question marks on all my Mac OS X browsers. So I had a
> close look at the source and sure enough, the UTF-8 bytes for the Greek
> Extended chars seemed to be wrong. When I asked the site author, he
> said everything displayed fine in Win IE. So I tried that and, sure
> enough, it did, despite the UTF-8 bytes appearing to be wrong.
> Can anyone tell me how this can happen? (Note: My version of Win IE 6
> is somewhat old, from 2001).
It still displays with mine, and I am up-to-date on the Windows XP security
patches. Firefox also detects the error, so it is only a problem with
browsers that check UTF-8 for validity.
The bad byte sequence (as stated on the test page) is E1 FC D0, instead of
E1 BC 90.
It's actually very simple. Given an initial byte E1, the next two bytes
must be of the form 10xxxxxx 10xxxxxx. If the parser then trusts alleged
UTF-8 to be valid UTF-8 (which it should not), it can then ignore the non-x
bits. Now, it is the second and third bytes that are incorrect, being FC
and D0 rather than BC and 90, ie. bit 6 is 1 whereas it must be 0. The low
six bits of FC (wrong) and BC (correct) and D0 (wrong) and 90 (correct) are
This archive was generated by hypermail 2.1.5 : Sun Apr 23 2006 - 10:53:33 CST