Re: UTF-16 encoding of malformed UTF-8 sequences

From: Ashley Yakeley (ashley@semantic.org)
Date: Tue Nov 02 1999 - 05:25:01 EST


At 1999-11-02 01:52, Markus Kuhn wrote:

>Let's just represent malformed UTF-8 sequences by malformed UTF-16
>sequences (unpaired low surrogates).

I must say I don't understand why this is necessary. I wrote a UTF-8
decoder in Java that takes a stream of bytes (octets) and returns a
readable stream of chars (UCS2 codepoints). If it comes across an illegal
UTF-8 sequence, it throws exception.

In writing it I took the position that bad input simply has no
interpretation whatever: if there's a bad sequence in the input, then
none of the input can be trusted and none of it should be used or passed
on to the user.

Unfortunately I have to also throw exception for UCS4 codepoints not in
the first 17 planes since there's currently no way to represent these in
UCS2. Not that that's likely to be a problem, but it bothered me in
principle. Nevertheless one might also argue that that too is correct
behaviour for code that translates octet-streams to UCS2-streams.

-- 
Ashley Yakeley, Seattle WA
Almost empty page: <http://semantic.org/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT