Re: Variations of UTF-16 (was: Re: "UNICODE BOMBER STRIKES AGAIN")

From: Mark Davis (mark@macchiato.com)
Date: Wed Apr 24 2002 - 10:37:43 EDT


below
—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Doug Ewell" <dewell@adelphia.net>
To: "Mark Davis" <mark@macchiato.com>; <unicode@unicode.org>
Cc: "Kenneth Whistler" <kenw@sybase.com>; <texin@progress.com>
Sent: Tuesday, April 23, 2002 23:02
Subject: Variations of UTF-16 (was: Re: "UNICODE BOMBER STRIKES
AGAIN")

> Mark Davis <mark@macchiato.com> wrote:
>
> > You can determine that that particular text is not legal UTF-32*,
> > since there be illegal code points in any of the three forms. IF
you
> > exclude null code points, again heuristically, that also excludes
> > UTF-8, and almost all non-Unicode encodings. That leaves UTF-16,
> > 16BE, 16LE as the only remaining possibilities. So look at those:
> >
> > 1. In UTF-16LE, the text is perfectly legal "Ken".
> > 2. In UTF-16BE or UTF-16, the text is the perfectly legal "䬀攀渀".
> >
> > Thus there are two legal interpretations of the text, if the only
> > thing you know is that it is untagged. IF you have some additional
> > information, such as that it could not be UTF-16LE, then you can
> > limit it further.
>
> OK, let me try to understand this again. I'm sorry, you guys should
> know that I'm not just trying to be a gadfly, but despite my efforts
I
> am still confused over whether an unlabeled, BOM-free sequence may
or
> may not be treated as little-endian UTF-16.
>
> I think what Mark is saying is that, given Ken's byte sequence:
>
> 0x4B 0x00 0x65 0x00 0x6E 0x00
>
> and some reason (heuristics, knowledge of platform, divine guidance,
> etc.) to believe that this is Unicode text represented in some
flavor of
> UTF-16, I have my choice of:
>
> (a) treating it as either "UTF-16BE" or "UTF-16" and decoding it as
> U+4B00 U+6500 U+6E00 ("䬀攀渀"), or
>
> (b) treating it as "UTF-16LE" and decoding it as U+004B U+0065
U+006E
> ("Ken"),
>
> *BUT*
>
> I must not *call* the sequence "UTF-16," since that term is
officially
> reserved for BOM-marked text which can be either little- or
big-endian,
> or BOMless text which must be big-endian.

Yes, assuming the "BUT" clause applies to (b). That is, the untagged
byte sequence

0x4B 0x00 0x65 0x00 0x6E 0x00

could be
(a) U+4B00 U+6500 U+6E00 ("䬀攀渀"): "UTF-16BE" or "UTF-16"
(b) U+004B U+0065 U+006E ("Ken"): "UTF-16LE"
(c) U+004B U+0000 U+0065 U+0000 U+006E U+0000
("K<null>e<null>n<null>"): ASCII, UTF-8, CP-1252, etc.
(d) ...: EBCDEC

If I really wanted to find out all the things it could be, I could run
it through the 700+ converters in ICU and capture all the cases that
don't detect illegal byte sequences. Except that the vast majority of
these are very unlikely because they would produce nulls in the code
point sequence.

>
> Is that what I have been missing all along? It's perfectly OK for
the
> text to be encoded and decoded this way, so long as nobody actually
> calls it "UTF-16"? If so, then I've probably been arguing over
nothing.

Not really arguing, just exploring the issues. But one key is that if
you are in an environment where untagged data is being exchanged (a
bad idea, anyway), *and* the convention for that environment is to use
the BOM (in either UTF-8, UTF-16, or UTF-32) thus excluding the
possibility of the explicit LE or BE forms, then that would further
winnow down the number of possible interpretations of untagged text.
In this case, that would select the (a) interpretation.

One real problem we have is that the *encoding form* UTF-16 and the
*encoding scheme* UTF-16 are very different, but have the same name.
If we had an explicit name for one or the other that would help to
reduce the confusion. (We also don't have a name to distinguish the
BOMed UTF-8 from the unBOMed, but that seems to cause less confusion.)

>
> -Doug Ewell
> Fullerton, California
>
>
>
>



This archive was generated by hypermail 2.1.2 : Wed Apr 24 2002 - 11:51:18 EDT