From: Philippe Verdy (firstname.lastname@example.org)
Date: Sun May 14 2006 - 21:46:45 CDT
From: "Doug Ewell" <email@example.com>
> Keutgen, Walter <walter dot keutgen at be dot unisys dot com> wrote:
>> Microsoft should leave the ill formed UTF-8 sequences aside for the
>> determination of the coded character set.
> I agree that if encodings need to be autodetected, allowing invalid
> UTF-8 to be handled as though it were valid UTF-8 hampers that effort.
> It is a shame --but as Mark Davis said, probably a given -- that
> autodetection is necessary at all.
>> Or alternatively, would it not be simpler to stick to the standards
>> and choose ISO-8859-1 when the HTML source does not provide any
> Actually, the code to do what IE does is of about equal complexity to
> the code to interpret UTF-8 strictly. I doubt it had anything to do
> with that.
>> More philosophically, is it really better to try making it better than
>> the standards?
> I *strongly* doubt that Microsoft is trying to reinvent UTF-8. As I
> said, they were probably trying to "be liberal in what they accept," and
> not have people throw eggs at their windows because some badly encoded
> Web page wouldn't display.
But in that case, there's more to loose to be liberal when accepting such incorrectly encoded UTF-8 that no existing common tool generates (except those written specifically to test this bug), if it breaks the autodetection of MUCH more common charsets (notably when this causes a French page encoded with ISO-8859-1 to be interpreted as if it was valid UTF-8, when it is not, and thus displayed with things like Han ideographs replacing occurences of 3 latin letters whose one is an accented ISO-8859-1 letter)
In fact, even a charset autodetector would try to use the "MS-weak-UTF-8" decoder only as a last option, after it fails detecting the language and also fails with the charset. But given that a ISO-8859-1 decoder normally never fails (except if the webpage contains some C0 controls forbidden in HTML), there's absolutely no reason why the MS-weak-UTF-8 decoder would fail or would be better.
In European Latin-based languages, the cases where a strict UTF-8 decoder will succeed is also ***extremely rarely*** a case where a legacy 7/8-bit charset (ISO8859, ISO64-based, DOSCP, WindowsCP, HP-Roman, and even EBCDIC variants) would succeed.
As a good charset autodetector is normally based on heuristics based on most frequent cases, trying to be liberal within the UTF-8 decoder makes absolutely no sense, and just adds to the confusion, because such bug will be used by people trying to bypass security systems by using those alternate MS-weak-UTF-8 encodings.
So this pseudo-liberal implementation creates new (possibly severe) security risks (in security systems that ignore the fact that UTF-8 could be incorrectly decoded, because the standard explicitly says that this is clearly invalid UTF-8), without solving ANY real problem for users: users will use the UTF-8 encoders of their OSes and applications, and I've still never seen any one that generates such invalid data, so actual texts encoded with them are virtually inexistant (except in forced test cases)!
The only good way to handle the situation "liberally" is then to replace all invalid bytes by U+FFFD, and not attempt to decode them as if they were valid.
This archive was generated by hypermail 2.1.5 : Sun May 14 2006 - 21:52:50 CDT