From: Philippe Verdy (firstname.lastname@example.org)
Date: Mon May 15 2006 - 10:22:00 CDT
From: "Doug Ewell" <email@example.com>
> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
>> This suggestion won't work. The security problem is in the browser,
>> not in the data itself which was created on purpose to break the UTF-8
>> Those attempting to use this problem will generate broken UTF-8 (for
>> example and notably to bypass email filtering against spam, based on
>> keyword detections)
>> If the filter is designed to detect specific words, and validates its
>> input before treating it, it will not detect the forbidden characters
>> or keywords, and the content will pass OK through these filters.
>> Then the content will be rendered using UTF-8 despite it should have
>> been blocked by input filters.
> Thus the statement I made earlier is proven true: people will find a way
> to criticize Microsoft regardless of what they do.
> Shawn Steele already said the IE team is investigating this situation.
I did not send any criticism against Microsoft, it was not even cited in this message, but gave other arguments against the liberal interpreation of UTF-8, regardless of the software using such decoder. butif input filters must now use a liberal decoder to detect characters using those invalid sequences, this will just complicate the work for everybody.
We'll soon see (may be this has already occured) spammed email for C*lis and V*gra drugs attempting to use this invalid UTF-8 encoding just to allow their spew to come into our mailboxes. Those spammers are recurringly attempting to use various encoding tricks to send this spew, and have their emails left undetected by antispam filters. As long as the content displays successfully in IE, it will be fine for them, and they will use this trick!
Now, if antispam filters must be updated to use the liberal UTF-8 decoding, this will create more false positive detections for emails that are not spam and were not intended to be rendered with UTF-8 but with a ISO 8859 charset.
So for me, the suggestion is that Microsoft:
* fixes the UTF-8 decoder,
* and possibly documents another explicit MIME charset name for the liberal decoder (this "weak-UTF-8" charset will be used by those that have large databases or text corpus that have not been completely converted to strict UTF-8, they will be able to continue publish their content on the web by just changing the MIME type sent by the server, this will be a temporary solution, until they have reencoded their database, or adapted their server-side software to reencode the text read from their database to strict UTF-8).
* and possibly integrates a way for the browser's users to select the liberal decoding mode manually (but never automatically with the autodetection mechanism of IE)
Such suggestion will avoid to have to implement the liberal mode in security filters, and notably antispam filters, and will avoid new false positive cases which would cause new troubles for all users.
This archive was generated by hypermail 2.1.5 : Mon May 15 2006 - 10:27:50 CDT