Mark Davis <firstname.lastname@example.org> wrote:
> You can determine that that particular text is not legal UTF-32*,
> since there be illegal code points in any of the three forms. IF you
> exclude null code points, again heuristically, that also excludes
> UTF-8, and almost all non-Unicode encodings. That leaves UTF-16,
> 16BE, 16LE as the only remaining possibilities. So look at those:
> 1. In UTF-16LE, the text is perfectly legal "Ken".
> 2. In UTF-16BE or UTF-16, the text is the perfectly legal "䬀攀渀".
> Thus there are two legal interpretations of the text, if the only
> thing you know is that it is untagged. IF you have some additional
> information, such as that it could not be UTF-16LE, then you can
> limit it further.
OK, let me try to understand this again. I'm sorry, you guys should
know that I'm not just trying to be a gadfly, but despite my efforts I
am still confused over whether an unlabeled, BOM-free sequence may or
may not be treated as little-endian UTF-16.
I think what Mark is saying is that, given Ken's byte sequence:
0x4B 0x00 0x65 0x00 0x6E 0x00
and some reason (heuristics, knowledge of platform, divine guidance,
etc.) to believe that this is Unicode text represented in some flavor of
UTF-16, I have my choice of:
(a) treating it as either "UTF-16BE" or "UTF-16" and decoding it as
U+4B00 U+6500 U+6E00 ("䬀攀渀"), or
(b) treating it as "UTF-16LE" and decoding it as U+004B U+0065 U+006E
I must not *call* the sequence "UTF-16," since that term is officially
reserved for BOM-marked text which can be either little- or big-endian,
or BOMless text which must be big-endian.
Is that what I have been missing all along? It's perfectly OK for the
text to be encoded and decoded this way, so long as nobody actually
calls it "UTF-16"? If so, then I've probably been arguing over nothing.
This archive was generated by hypermail 2.1.2 : Wed Apr 24 2002 - 03:30:26 EDT