Re: BOM ambiguity? from Ken Whistler on 2012-07-13 (Unicode Mail List Archive)

From: Ken Whistler <kenw_at_sybase.com>
Date: Fri, 13 Jul 2012 15:43:13 -0700

On 7/13/2012 1:54 PM, Stephan Stiller wrote:
> So there is a BOM-ambiguity when a file starts with
> FF FE
> and then a couple of U+0000 characters, yes? Because this could be
> either UTF-16 or UTF-32 under little-endianness. Has this been pointed
> out and discussed beforehand?
>
>

No, there is not a "BOM-ambiguity". Rather, there is an English ambiguity
in your question concerning the meaning of "a file" and its contents.

If "a file" is a byte stream interpreted as an LE Unicode 16-bit string,
then:

FF FE 00 00 82 04 01 00 ... --> <U+FEFF, U+0000, U+0482, U+0001>

If "a file" is a byte stream interpreted as an LE Unicode 32-bit string,
then:

FF FE 00 00 82 04 01 00 ... --> <U+FEFF, U+10482>

If "a file" is a byte stream interpreted as an ISO 8859-1 string, then:

FF FE 00 00 82 04 01 00 ... --> <y-diaeresis, thorn, nul, nul, bph, eot,
soh, nul>

If "a file" is a byte stream interpreted as a packed sequence of C
strings, then:

FF FE 00 00 82 04 01 00 ... --> <0xFF,0xFE>, <>, <0x82,0x04,0x01> ...

If "a file" is a byte stream interpreted as some other binary format, then:

FF FE 00 00 80 04 01 00 ... --> could be anything, maybe part of a
picture of a cat

And, of course, if you tried to interpret that byte stream as either
big-endian
UTF-16 or big-endian UTF-32, you would get ill-formed sequences.

The only "problem" here is if you don't *know* what the data is, and try to
guess it by examining only the first four bytes. A heuristic that does
that is
just broken. But any halfway decent heuristic can easily distinguish, say
UTF-16 and UTF-32 data (in either byte order) with good reliability
after examining
only a short stretch of otherwise unidentified candidate data.

--Ken
Received on Fri Jul 13 2012 - 17:45:46 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 13 2012 - 17:45:47 CDT