RE: "UNICODE BOMBER STRIKES AGAIN"

From: Yves Arrouye (yves@realnames.com)
Date: Wed Apr 24 2002 - 13:39:10 EDT


> You can determine that that particular text is not legal UTF-32*,
> since there be illegal code points in any of the three forms. IF you
> exclude null code points, again heuristically, that also excludes
> UTF-8, and almost all non-Unicode encodings. That leaves UTF-16, 16BE,
> 16LE as the only remaining possibilities. So look at those:
>
> 1. In UTF-16LE, the text is perfectly legal "Ken".
> 2. In UTF-16BE or UTF-16, the text is the perfectly legal "䬀攀渀".
>
> Thus there are two legal interpretations of the text, if the only
> thing you know is that it is untagged. IF you have some additional
> information, such as that it could not be UTF-16LE, then you can limit
> it further.

Actually, I also think that without any external information about the
encoding except that it is some UTF-16, it *has to* be interpreted as being
most significant byte first. I agree that it could be either UTF-16LE or
UTF-16BE/UTF-16, but in the absence of any other information, at this point
in time, it is ruled by the text of 3.1 C3 of TUS 3.0 and the reader has no
choice but to declare it UTF-16.

Now what about auto-detection in relation to this conformance clause?
Readers that first try to be smart by auto-detecting encodings could of
course pick any of these as the 'auto-detected' one. Does that violate 3.1
C3's interpretation of bytes? I would say that as long as the auto-detector
is seen as a separate process/step, one can get away with it, since by the
time you look at the bytes to process the data, their encoding has been set
by the auto-detector.

YA



This archive was generated by hypermail 2.1.2 : Wed Apr 24 2002 - 14:28:43 EDT