Re: Win IE 7b2 and UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue May 16 2006 - 04:38:44 CDT

  • Next message: Asmus Freytag: "Re: CLDR"

    From: "Shawn Steele" <Shawn.Steele@microsoft.com>
    > Presuming that a filter understood that the mail was UTF-8, then it
    > should realize that there're a bunch of FFFDs in the content. IMHO
    > that'd be enough reason to consider filtering it.

    Why would an email filter consider U+FFFD as spam? U+FFFD is not spam specific andmay occur only because of failure to detect the charset properly. Such failure may happen with some charsets still not supported by the input decoder, or failure to detect and decode the transport encoding syntax (including compression, or UU-encoding), or anemailsent without the normally necessary header that specifies its format. I tend to make a difference between a spam in which I want to detect some words, and an email that my automatic email filter can't decode properly, but which could still be valid and using an alternate format.

    Given that the absence ofthe format islikely anerror fromthe sender, if it can't bereaduisng standard syntaxes, it will not reach its intended audience if it is a spam, so spammers would likely not use such invalid format.Yesthey use various encoding tricks to get their audience, but they first ensurethatthemessage can be read without specialhandling by the recipient. If any uncommon handling is needed to read the email, the spammer already failed at start. So such email is likely not spam.

    On the opposite, an email that would use the "weak-UTF-8" encoding would reachits audience on Windows with IE decoding it properly. A keyword-based email filter however would not properly detect the keywords. (I spoke about email, but this also concerns websites with kids protection filters which also use keyword detection against websites that donot properly tag their porn sites with rating labels).

    To detect those cases, the filter wouldhave to implement the "weak-UTF-8" decoder. But I think that the simpler is to completely avoid rendering "weak-UTF-8" as if it was valid "UTF-8". If it does not decode immediately, then the renderer will will render invalid sequences as if they were replaced by U+FFFD. So spammers will know thatthey don't reach their audience and so will not use "weak-UTF-8" encoding but complying strict UTF-8 encoding.

    > Firewalls and filters should probably immediately suspect any data that
    > the determine is misencoded in any encoding.

    That's not my opinion. Encoding formats are florishing, and blocking them all will simply preventseeing the new contents with the properly installed new tools to render them.

    > > * and possibly documents another explicit MIME charset name for the
    > > liberal decoder
    > That won't happen. We certainly don't want to introduce new code pages.
    > Anyone broken by a fixed UTF-8 would have to fix their web site.

    I only think they should update their website (but the truth is that what they may need to do is more complex than that as it may require reencoding large databases that are kept intact for historical or legal archiving reasons (such as book records in a registry or corpus of texts). Changing those texts may be a dangerous option to apply automatically on all texts if it creates losses of data,notably when some texts mix acompliant UTF-8 with a few minor errors caused by interchanges and aggregation of data from different sources.

    So having the possibility to distribute the unmodified content "as-is", but without atag saying its plain UTF-8 seems a valid option, when the volume of texts to review is large. Converting those texts to struct UTF-8 would require much time in such mixed historical database of texts from various sources.

    That's where I understand the problem of those that may have mixed in the past years texts from various sources and lost the possibility to detect how they were originally encoded. Tagging this data as "weak-UTF-8" instead of strict "UTF-8" (or simply nottagging it at all) would help separating the cases: the autodetectors on the recipient host will start loooking at the data, will only need to detect strict UTF-8 and then present automatically andproperly to the user only those documents that obey the rules. Then the user will have to manually retry to reinterpret the document using a weak-UTF-8 decoder. This should not be automated, so that users will beable to signal to the source that the document is not properly UTF-8 encoded, and may bethe user will be able to make the corrections.



    This archive was generated by hypermail 2.1.5 : Tue May 16 2006 - 04:45:17 CDT