Yet another reason some software treat your UTF-8 xml as US-ASCII

From: Frank Yung-Fong Tang (ytang0648@aol.com)
Date: Thu May 06 2004 - 10:34:47 CDT


For sure no one in this mailling list want to see your xml got treated as US-ASCII when the data is really in UTF-8.

If I have an xml file like the following

<?xml version="1.0"?>
....


and send over the HTTP protocol with the following content type header:

Content-Type: text/xml;

(without the charset=UTF-8)

Guess what charset should the receiver use as the charset of the xml?
UTF-8? ISO-8859-1? or US-ASCII?

If you only read the XML 1.0 specification, I guess you will conclude it should be treated as "UTF-8". However, if you also read RFC 3023, then ... the answer is "US-ASCII"

see http://www.faqs.org/rfcs/rfc3023.html

[...]
3.1 Text/xml Registration
[....]
Conformant with [RFC2046], if a text/xml entity is received with
the charset parameter omitted, MIME processors and XML processors
MUST use the default charset value of "us-ascii"[ASCII]. In cases
where the XML MIME entity is transmitted via HTTP, the default
charset value is still "us-ascii".
[....]

:(

Notice if the type is application/xml, the rule changed!!!
3.2 Application/xml Registration
[...]
If an application/xml entity is received where the charset
parameter is omitted, no information is being provided about the
charset by the MIME Content-Type header. Conforming XML
processors MUST follow the requirements in section 4.3.3 of [XML]
that directly address this contingency. However, MIME processors
that are not XML processors SHOULD NOT assume a default charset if
the charset parameter is omitted from an application/xml entity.
[...]

:( :( :(




This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:26 CDT