RE: UTF-16 encoding of malformed UTF-8 sequences

From: Marco.Cimarosti@icl.com
Date: Tue Nov 02 1999 - 10:52:37 EST


Markus Kuhn's approach as seen by an end-user perpective:
"I have an UTF-8 file that I can edit even with good old VI (however, the
Chinese characters in it look like garbage). When I load it with MarkusEdit
1.0, those Chinese characters still look like garbage. Something must be
wrong, could you please find the time to investigate this before next
release?"

Ashley Yakley's approach as seen by an end-user perpective:
"I have an UTF-8 file that I can edit even with good old VI (however, the
Chinese characters in it look like garbage). When I tried to load it with
AshleyEdit 1.0, it broke with an exception. I don't have time to loose with
your bugs, so I won't use that program anymore!"

John Cowan's approach as seen by an end-user perpective:
"I had an UTF-8 file that I could edit even with good old VI (however, the
Chinese characters in it looked like garbage). When I loaded it with
JohnEdit 1.0, those Chinese characters still looked like garbage. But the
surprise came when I saved the file, and all my Chinese characters became
0x80! That caused me a million dollars damage! So expect a letter from my
lawyer!"

Note: in all cases, the "UTF-8" text file was actually a *GB* text file, so
the bug was actually in the end-user's head, as you will try to explain
him/her...

Ciao.
        Marco

> -----Original Message-----
> From: Ashley Yakeley [SMTP:ashley@semantic.org]
> Sent: 1999 November 02, Tuesday 11.25
> To: Unicode List
> Subject: Re: UTF-16 encoding of malformed UTF-8 sequences
>
> At 1999-11-02 01:52, Markus Kuhn wrote:
>
> >Let's just represent malformed UTF-8 sequences by malformed UTF-16
> >sequences (unpaired low surrogates).
>
> I must say I don't understand why this is necessary. I wrote a UTF-8
> decoder in Java that takes a stream of bytes (octets) and returns a
> readable stream of chars (UCS2 codepoints). If it comes across an illegal
> UTF-8 sequence, it throws exception.
>
> In writing it I took the position that bad input simply has no
> interpretation whatever: if there's a bad sequence in the input, then
> none of the input can be trusted and none of it should be used or passed
> on to the user.
>
> Unfortunately I have to also throw exception for UCS4 codepoints not in
> the first 17 planes since there's currently no way to represent these in
> UCS2. Not that that's likely to be a problem, but it bothered me in
> principle. Nevertheless one might also argue that that too is correct
> behaviour for code that translates octet-streams to UCS2-streams.
>
> --
> Ashley Yakeley, Seattle WA
> Almost empty page: <http://semantic.org/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT