From: Philippe Verdy (email@example.com)
Date: Sun May 14 2006 - 22:01:29 CDT
From: "Tom Gewecke" <firstname.lastname@example.org>
> On May 12, 2006, at 8:44 AM, Doug Ewell wrote:
>> As Philippe himself pointed out, there's probably not much of this
>> type of data out there.
> Indeed, the only web site of this type which I have ever come across
> was one with the entire Bible in polytonic Greek, in which all the
> Greek Extended characters had the wrong UTF-8. It was not easy to
> convince the author there was anything wrong, since Win IE displayed it
> correctly. How it could get created in the first place is still a
> mystery -- possibly via a buggy NCR to UTF-8 conversion program.
Well then offer a liberal UTF-8 mode in IE, but exclude it from the autodetection. Users willhave to select it manually to readthe text, and the authors will know how to correct their websites and will use this liberal decoder to reencode their invalid text.
Users will be happy, notably:
* those using something else than IE, trying to read the content of this site
* the many users that are horrified to see Chinese characters in the middle of a web page which was created with a ISO 8859 charset, simply because their hosting webserver does not offer a way to specify metadata or forces the HTTP headers to UTF-8, even for HTML pages that were not encoded with it.
For me, any invalid UTF-8 sequence should never be interpreted liberally in the autodection mode: if the heuristic says it's probably UTF-8, then all invalid sequences should be replaced by U+FFFD and rendered accordingly with the missing glyph symbol (rectangle, question mark...) used in most fonts for this character.
This archive was generated by hypermail 2.1.5 : Sun May 14 2006 - 22:03:21 CDT