Re: Yet another reason some software treat your UTF-8 xml as US-ASCII

From: Tim Greenwood (
Date: Thu May 06 2004 - 14:18:19 CDT

This situation is rather analogous to the case where HTTP is sent with no charset parameter, either directly or in an HTML META statement. RFC 2616 is explicit in section 3.7.1

  " When no explicit charset
   parameter is provided by the sender, media subtypes of the "text"
   type are defined to have a default charset value of "ISO-8859-1" when
   received via HTTP. "

However every browser that I have examined violates this and actually guesses the character
set from other information available to it, such as the locale of the machine, or an explicit user setting. To my mind the browser manufacturers are correct and the standard is wrong.

One thing that RFC does get right in correcting some earlier deviant behavior of browsers is in section 3.4.1

"3.4.1 Missing Charset

   Some HTTP/1.0 software has interpreted a Content-Type header without
   charset parameter incorrectly to mean "recipient should guess."
   Senders wishing to defeat this behavior MAY include a charset
   parameter even when the charset is ISO-8859-1 and SHOULD do so when
   it is known that it will not confuse the recipient.

   Unfortunately, some older HTTP/1.0 clients did not deal properly with
   an explicit charset parameter. HTTP/1.1 recipients MUST respect the
   charset label provided by the sender; and those user agents that have
   a provision to "guess" a charset MUST use the charset from the
   content-type field if they support that charset, rather than the
   recipient's preference, when initially displaying a document. See
   section 3.7.1."

i.e. - if it is there, do as it says. Here the standard is almost, but not quite, admitting that the previous RFC 2068 was wrong and the clients correct in the absence of a charset parameter. It is a pity that it did not correct the error rather than repeating it in section 3.7.1 - but of little practical concern since that section is ignored in practice.

- Tim

This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:26 CDT