Re: UTF-8 to UTF-16LE

From: John Cowan (jcowan@reutershealth.com)
Date: Tue Jul 08 2003 - 12:28:57 EDT

  • Next message: John Cowan: "Re: SPAM: Re: Yerushala(y)im - or Biblical Hebrew"

    Philippe Verdy scripsit:

    > Not bogous: the HTTP header is less important than an explicit
    > declaration in the XML document.

    You've misread me or RFC 3023 or both. The charset parameter in the MIME
    header *overrides* the encoding declaration in the XML content. If the
    header says "ISO 8859-1", then the character encoding of the contents is
    ISO 8859-1, no matter what the encoding declaration says or doesn't say.

    What is even worse is that if the media type is text/xml (as opposed to
    application/xml), and the charset parameter is not specified, the
    character encoding of the contents is US-ASCII, again no matter what
    the encoding declaration says or doesn't say.

    > The default UTF-8/UTF-16 only applies to the case where there is
    > *neither* a XML declaration, *nor* an external meta-data declaration
    > such as HTTP headers.

    Correct.

    > However the BOM may be omitted from the "UTF-16" encoding scheme,
    > and in that case it MUST be decoded only as UTF-16BE.

    Actually, RFC 2781 says "SHOULD" in that case, not "MUST". I agree that this
    should (or even must) be strengthened in future.

    -- 
    John Cowan  jcowan@reutershealth.com  www.ccil.org/~cowan  www.reutershealth.com
    I must confess that I have very little notion of what [s. 4 of the British
    Trade Marks Act, 1938] is intended to convey, and particularly the sentence
    of 253 words, as I make them, which constitutes sub-section 1.  I doubt if
    the entire statute book could be successfully searched for a sentence of
    equal length which is of more fuliginous obscurity. --MacKinnon LJ, 1940
    


    This archive was generated by hypermail 2.1.5 : Tue Jul 08 2003 - 13:20:12 EDT