RE: UTF-8 to UTF-16LE

From: Jon Hanna (jon@spin.ie)
Date: Tue Jul 08 2003 - 11:06:05 EDT

  • Next message: Peter Kirk: "Re: Yerushala(y)im - or Biblical Hebrew"

    > On Tuesday, July 08, 2003 2:22 PM, Jon Hanna <jon@spin.ie> wrote:
    >
    > > According
    > > > to XML the
    > > > default encoding scheme is UTF-8.
    > >
    > > Not strictly true. The default encoding scheme's is UTF-8 *or*
    > > UTF-16LE *or* UTF-16BE,
    >
    > Wrong also: UTF-16LE and UTF16-BE are not in the default encoding
    > schemes. Only UTF-16 is an acceptable default encoding scheme, as
    > it uses an explicit BOM, unlike UTF-16BE and UTF-16LE where
    > there's NO byte order mark.

    An error in my expression rather than my point. I wanted to make clear that
    there is a requirement to detect whether the UTF-16 is big or little endian.
    Of course stating that it in terms of UTF-16BE or UTF-16LE was completely
    wrong of me. I abbreviated to BE and LE as I typed. I apologise to Santhosh
    as there's nothing worse than being mis-informed in response to a question
    when you're trying to get work done based on it.

    UTF-16BE and UTF-16LE (that is without a BOM) can only be used if they are
    declared.

    > So the default encoding schemes are really:
    > - UTF-8: byte order mark not recommanded, but this is commonly
    > found (EF,BB,BF)

    The BOM is not recommended, but it's not NOT RECOMMENDED in the RFC sense
    either. IIRC there was talk of either prohibiting it or making it NOT
    RECOMMENDED at some point, but this was decided against.

    > - UTF-16: with a required byte order mark (FE,FF or FF,FE)
    > - UTF-32: with a recommanded byte order mark (00,00,FE,FF or FF,FE,00,00)

    There is no requirement to support UTF-32. It's worth noting that since it
    is an error for UTF-32 encoded XML not to begin with an XML declaration the
    first character will be U+003C. This will either be the 5th through 8 octets
    (if there is a BOM) or the first 4 octets, hence 00,00,00,3C and 3C,00,00,00
    identify UTF-32 without a BOM (feasibly 00,00,3C,00 and 00,3C,00,00 as
    well).
    Similarly 00,3C,00,3F would begin UTF-16BE or UCS-2 big-endian and
    3C,00,3F,00 would begin UTF-16LE or UCS-2 little-endian (since the first 2
    characters must be <?). You'd have to read the declaration to know which,
    but the declarations don't use any character for which those encodings
    differ, so this at least gives one enough information to decode the
    declaration.

    > With UTF16-BE, UTF16-LE, UTF-32BE, UTF-32LE, the encoding scheme
    > can be ambiguous with legal UTF-8!

    I can't think of an example of legal UTF-8 XML that can be confused with
    legal examples of any of the others in XML where the difference won't be
    detected in 4 or less octets. As far as I can see all of the encodings other
    than UTF-8 either have octets that would be decoded under UTF-8 to U+0000
    (not allowed in XML) and/or octets that are illegal UTF-8 encodings (in
    particular 0xFE and 0xFF).

    > Note that for UTF-32, the byte order mark may be ignored, as the
    > position of the leading or trailing null byte in all encoded
    > characters determines the byte order. The second byte beside it
    > is also always between 0x00 and 0x10. However the last two planes
    > 0x0F and 0x10 are private, and should not be used in XML,

    And cannot in the first few characters (legally), since these must be "<?xml
    ".



    This archive was generated by hypermail 2.1.5 : Tue Jul 08 2003 - 12:01:28 EDT