At 1999-11-02 01:52, Markus Kuhn wrote:
>Let's just represent malformed UTF-8 sequences by malformed UTF-16
>sequences (unpaired low surrogates).
I must say I don't understand why this is necessary. I wrote a UTF-8
decoder in Java that takes a stream of bytes (octets) and returns a
readable stream of chars (UCS2 codepoints). If it comes across an illegal
UTF-8 sequence, it throws exception.
In writing it I took the position that bad input simply has no
interpretation whatever: if there's a bad sequence in the input, then
none of the input can be trusted and none of it should be used or passed
on to the user.
Unfortunately I have to also throw exception for UCS4 codepoints not in
the first 17 planes since there's currently no way to represent these in
UCS2. Not that that's likely to be a problem, but it bothered me in
principle. Nevertheless one might also argue that that too is correct
behaviour for code that translates octet-streams to UCS2-streams.
-- Ashley Yakeley, Seattle WA Almost empty page: <http://semantic.org/>
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT