Re: UTF-8 to UTF-16LE

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Jul 08 2003 - 11:20:04 EDT

  • Next message: Jon Hanna: "RE: UTF-8 to UTF-16LE"

    On Tuesday, July 08, 2003 4:17 PM, John Cowan <jcowan@reutershealth.com> wrote:
    > XML parsers MUST support UTF-16, with a BOM and in either order, and
    > UTF-8. All other encodings MUST be properly declared.
    > (Bogusly IMHO, an HTTP Content-Type: header overrides this rule.)

    Not bogous: the HTTP header is less important than an explicit declaration in the XML document.

    So if the header says:
        Content-Type: text/xml; charset=iso-8859-1
    or:
        Content-Type: text/xml
    and there's a declaration like:
        <?xml version="1.0" encoding="UTF-16"?>
    then the document is encoded in UTF-16, whatever the HTTP header specifies or omits.

    The default UTF-8/UTF-16 only applies to the case where there is
    *neither* a XML declaration, *nor* an external meta-data declaration
    such as HTTP headers.

    UTF-16BE and UTF-16LE are not suitable as defaults without an explicit declaration for XML.
    However the BOM may be omitted from the "UTF-16" encoding scheme, and in that case it MUST be decoded only as UTF-16BE.

    This means that a XML file starting by 0x3C,0x00 or 0x20,0x00 would be decoded as UTF-8, not as UTF-16LE, unless UTF-16LE is *explicitly* specified in the following XML declaration, which must consist only in 7-bit bytes: the XML parsing must fail if the explicit "encoding" attribute is not found in that declaration, or if there's no explicit XML declaration, because NUL characters are not allowed in "text/xml" documents.

    -- Philippe.



    This archive was generated by hypermail 2.1.5 : Tue Jul 08 2003 - 12:01:14 EDT