Re: BOM ambiguity? from Philippe Verdy on 2012-07-13 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sat, 14 Jul 2012 01:18:16 +0200

Just eliminate the cases where you find U+0000. For plain-text files
they are not useful. If you're trying to guess which encoding is used
in an HTML or XML file, you won't find any null (because they are
invalid in those formats, in all enodings even with ISO-8859-*). In
those conditions, there's only one way to decode your example stream
and it is enough to just check the 4 first bytes.

The ambiguity comes between UTF-8 and ISO-8859-* (but again no HTML or
XML file can start by an y-diaeresis in the ISO-8859-1 hypothesis, or
the other characters bound in other ISO 8859 parts : the ambiguity
persists in arbitrary plain text files, but not from HTML and XML
documents)

2012/7/14 Ken Whistler <kenw_at_sybase.com>:
> On 7/13/2012 1:54 PM, Stephan Stiller wrote:
>>
>> So there is a BOM-ambiguity when a file starts with
>> FF FE
>> and then a couple of U+0000 characters, yes? Because this could be either
>> UTF-16 or UTF-32 under little-endianness. Has this been pointed out and
>> discussed beforehand?
>>
>>
>
> No, there is not a "BOM-ambiguity". Rather, there is an English ambiguity
> in your question concerning the meaning of "a file" and its contents.
>
> If "a file" is a byte stream interpreted as an LE Unicode 16-bit string,
> then:
>
> FF FE 00 00 82 04 01 00 ... --> <U+FEFF, U+0000, U+0482, U+0001>
>
> If "a file" is a byte stream interpreted as an LE Unicode 32-bit string,
> then:
>
> FF FE 00 00 82 04 01 00 ... --> <U+FEFF, U+10482>
>
> If "a file" is a byte stream interpreted as an ISO 8859-1 string, then:
>
> FF FE 00 00 82 04 01 00 ... --> <y-diaeresis, thorn, nul, nul, bph, eot,
> soh, nul>
>
> If "a file" is a byte stream interpreted as a packed sequence of C strings,
> then:
>
> FF FE 00 00 82 04 01 00 ... --> <0xFF,0xFE>, <>, <0x82,0x04,0x01> ...
>
> If "a file" is a byte stream interpreted as some other binary format, then:
>
> FF FE 00 00 80 04 01 00 ... --> could be anything, maybe part of a picture
> of a cat
>
> And, of course, if you tried to interpret that byte stream as either
> big-endian
> UTF-16 or big-endian UTF-32, you would get ill-formed sequences.
>
> The only "problem" here is if you don't *know* what the data is, and try to
> guess it by examining only the first four bytes. A heuristic that does that
> is
> just broken. But any halfway decent heuristic can easily distinguish, say
> UTF-16 and UTF-32 data (in either byte order) with good reliability after
> examining
> only a short stretch of otherwise unidentified candidate data.
>
> --Ken
>
>
>
>
Received on Fri Jul 13 2012 - 18:19:58 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 13 2012 - 18:19:59 CDT