Re: UTF-8 to UTF-16LE

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Jul 08 2003 - 09:26:22 EDT

  • Next message: Jim Allan: "Re: French group separators, was Re: The character for 10**24 inJapanesenumbers (jo)"

    On Tuesday, July 08, 2003 2:22 PM, Jon Hanna <jon@spin.ie> wrote:

    > According
    > > to XML the
    > > default encoding scheme is UTF-8.
    >
    > Not strictly true. The default encoding scheme's is UTF-8 *or*
    > UTF-16LE *or* UTF-16BE,

    Wrong also: UTF-16LE and UTF16-BE are not in the default encoding schemes. Only UTF-16 is an acceptable default encoding scheme, as it uses an explicit BOM, unlike UTF-16BE and UTF-16LE where there's NO byte order mark.

    So the default encoding schemes are really:
    - UTF-8: byte order mark not recommanded, but this is commonly found (EF,BB,BF)
    - UTF-16: with a required byte order mark (FE,FF or FF,FE)
    - UTF-32: with a recommanded byte order mark (00,00,FE,FF or FF,FE,00,00)

    With UTF16-BE, UTF16-LE, UTF-32BE, UTF-32LE, the encoding scheme can be ambiguous with legal UTF-8!

    Note that for UTF-32, the byte order mark may be ignored, as the position of the leading or trailing null byte in all encoded characters determines the byte order. The second byte beside it is also always between 0x00 and 0x10. However the last two planes 0x0F and 0x10 are private, and should not be used in XML, so this plane byte will really be 0x00 most of the time for characters of the BMP, 0x01 for rare scripts, 0x02 more often for ideographic supplements, and rarely 0x0E for some special characters (like language tags).

    UTF-32 can be used safely on networks but it would probably be compressed (deflated) in the transport (for example over HTTP). Most Unicode-compliant softwares however store and manage strings directly in their UTF-16 encoding form (not one of the three encoding schemes for UTF-16), and use UTF-32 only for internal intermediate processing of surrogates within the implementation of an API, or to check character properties in lookup tables based on UTF-32 bits.

    For storage, the UTF-16 encoding form can be serialized either to an UTF-8 encoding scheme, or sometimes to SCSU or BOCU compression schemes (BOCU is useful for database indexing and not contextual, SCSU is a contextual compression scheme, but is even more compressed).



    This archive was generated by hypermail 2.1.5 : Tue Jul 08 2003 - 10:12:01 EDT