Re: UTF-8 to UTF-16LE

From: John Cowan (jcowan@reutershealth.com)
Date: Tue Jul 08 2003 - 10:40:51 EDT

  • Next message: Rick McGowan: "Re: RE: UTF-8 to UTF-16LE"

    Philippe Verdy scripsit:

    > - UTF-32: with a recommanded byte order mark (00,00,FE,FF or FF,FE,00,00)

    UTF-32 requires an XML declaration (always assuming there is no MIME header
    in scope), even though it is easy to autodetect.

    > With UTF16-BE, UTF16-LE, UTF-32BE, UTF-32LE, the encoding scheme can
    > be ambiguous with legal UTF-8!

    In fact no, because all of these schemes require an 0x00 byte somewhere
    in the first four bytes (because the first character in an XML document
    must be less than U+00FF, specifically either < or whitespace), and
    that represents U+0000 in UTF-8, a character which cannot occur in
    well-formed XML. No ambiguity is possible, but the XML Rec makes this
    a well-formedness error anyway.

    > However the last two planes 0x0F and 0x10 are
    > private, and should not be used in XML,

    It is not inappropriate to use the Private Use planes in XML, provided
    you have an agreement in place with the recipient as to their meaning.
    Not all XML documents are meant to be interchanged blind. Far from it, as
    the private said when he missed the target and hit the gunnery instructor.

    > Most Unicode-compliant softwares however store and manage strings directly
    > in their UTF-16 encoding form

    There is plenty of software that uses UTF-8 internally as well.

    -- 
    John Cowan  jcowan@reutershealth.com  www.reutershealth.com  www.ccil.org/~cowan
    I am he that buries his friends alive and drowns them and draws them
    alive again from the water. I came from the end of a bag, but no bag
    went over me.  I am the friend of bears and the guest of eagles. I am
    Ringwinner and Luckwearer; and I am Barrel-rider.  --Bilbo to Smaug
    


    This archive was generated by hypermail 2.1.5 : Tue Jul 08 2003 - 11:44:22 EDT