Re: Default endianness of Unicode, or not

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Apr 14 2002 - 18:28:40 EDT


Mark Davis <mark@macchiato.com> wrote:

> Part of the problem is that the term "UTF-16" means two different
> things. Let me see if I can make it clearer.
>
> Let "UTF-16M" refer to the in-memory form, which is sequence of 16-
> bit code units. The byte ordering is logically immaterial, since it
> is not a sequence of bytes. Such a sequence does not use a BOM. The
> code point sequence <U+1234 U+0061 U+10000> is represented as the
> UTF-16M sequence <0x1234 0x0061 0xD800 0xDC00>.
>
> Let "UTF-16", on the other hand, refer to only the byte-serialized
> form.

I think I understand the difference between the CEF called "UTF-16" and
the CES called "UTF-16." That isn't where I'm having a problem.

> The UTF-16M sequence <0x1234, 0x0061, 0xD800, 0xDC00> is represented
> as one of:
> <0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOMless
> <0xFE 0xFF 0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOM
> <0xFF 0xFE 0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> // MOB

*This* is where I'm having a problem. Mark states here, again, that
BOM-less UTF-16 (the CES) must be big-endian. That is:

<0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> // MOBless

is not an instance of any valid CES. That, to me, is a change from what
Unicode has stated before, and from what Ken just said about using
"other information" (which could include external tagging, knowledge of
the originating platform, or heuristics) to determine the intended byte
order.

Remember, I like the BOM. I happen to think it's a useful indicator of
both file type and byte order (not really two different topics). But I
do think the official deprecation, or omission from mention, of BOM-less
little-endian UTF-16 is a change from past definitions that renders
nonconformant a potentially large amount of existing UTF-16 data.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Sun Apr 14 2002 - 16:54:50 EDT