Variations of UTF-16 (was: Re: "UNICODE BOMBER STRIKES AGAIN")

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Apr 24 2002 - 02:02:11 EDT

Previous message: Doug Ewell: "Re: How many printable characters in 3.2.0?"
In reply to: Mark Davis: "Re: "UNICODE BOMBER STRIKES AGAIN""
Next in thread: Mark Davis: "Re: Variations of UTF-16 (was: Re: "UNICODE BOMBER STRIKES AGAIN")"
Next in thread: Florian Weimer: "Re: "UNICODE BOMBER STRIKES AGAIN""
Reply: Mark Davis: "Re: Variations of UTF-16 (was: Re: "UNICODE BOMBER STRIKES AGAIN")"
Reply: jarkko.hietaniemi@nokia.com: "RE: Variations of UTF-16 (was: Re: "UNICODE BOMBER STRIKES AGAIN")"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> You can determine that that particular text is not legal UTF-32*,
> since there be illegal code points in any of the three forms. IF you
> exclude null code points, again heuristically, that also excludes
> UTF-8, and almost all non-Unicode encodings. That leaves UTF-16,
> 16BE, 16LE as the only remaining possibilities. So look at those:
>
> 1. In UTF-16LE, the text is perfectly legal "Ken".
> 2. In UTF-16BE or UTF-16, the text is the perfectly legal "䬀攀渀".
>
> Thus there are two legal interpretations of the text, if the only
> thing you know is that it is untagged. IF you have some additional
> information, such as that it could not be UTF-16LE, then you can
> limit it further.

OK, let me try to understand this again. I'm sorry, you guys should
know that I'm not just trying to be a gadfly, but despite my efforts I
am still confused over whether an unlabeled, BOM-free sequence may or
may not be treated as little-endian UTF-16.

I think what Mark is saying is that, given Ken's byte sequence:

0x4B 0x00 0x65 0x00 0x6E 0x00

and some reason (heuristics, knowledge of platform, divine guidance,
etc.) to believe that this is Unicode text represented in some flavor of
UTF-16, I have my choice of:

(a) treating it as either "UTF-16BE" or "UTF-16" and decoding it as
U+4B00 U+6500 U+6E00 ("䬀攀渀"), or

(b) treating it as "UTF-16LE" and decoding it as U+004B U+0065 U+006E
("Ken"),

*BUT*

I must not *call* the sequence "UTF-16," since that term is officially
reserved for BOM-marked text which can be either little- or big-endian,
or BOMless text which must be big-endian.

Is that what I have been missing all along? It's perfectly OK for the
text to be encoded and decoded this way, so long as nobody actually
calls it "UTF-16"? If so, then I've probably been arguing over nothing.

-Doug Ewell
Fullerton, California

Previous message: Doug Ewell: "Re: How many printable characters in 3.2.0?"
In reply to: Mark Davis: "Re: "UNICODE BOMBER STRIKES AGAIN""
Next in thread: Mark Davis: "Re: Variations of UTF-16 (was: Re: "UNICODE BOMBER STRIKES AGAIN")"
Next in thread: Florian Weimer: "Re: "UNICODE BOMBER STRIKES AGAIN""
Reply: Mark Davis: "Re: Variations of UTF-16 (was: Re: "UNICODE BOMBER STRIKES AGAIN")"
Reply: jarkko.hietaniemi@nokia.com: "RE: Variations of UTF-16 (was: Re: "UNICODE BOMBER STRIKES AGAIN")"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Wed Apr 24 2002 - 03:30:26 EDT