Re: illegal UTF-8 sequences and mbtowc()

From: Henning Brunzel (hbrunzel@yahoo.com)
Date: Sat Oct 30 1999 - 08:37:22 EDT


Markus Kuhn wrote:
> c) Let's extend UTF-16 to provide an encoding of malformed UTF-8 sequences.
> For instance, we could define in Plane 14 255 bytes that represent
> bytes which were part of an illegal UTF-8 sequence. This would allow
> loss-less UTF-8 -> UTF-16 -> UTF-8 conversion even for arbitrary random
> byte-strings that do not look anything like valid UTF-8.
IIRC the starting point for this was to get only one code for every
malformed
sequence instead of every byte. This proposal would actually get two
codes per byte. Am I missing something?
> No
> information would be lost. Would there even be space in the BMP for
> this? (The was sufficient space for adding Braille after all!)
This would give one code per malformed byte again.

What about
        d) U+FFFF. Someone once said, this would be a kind of special Pravate
        Use Control Character. Of course this wouldn't be legal Unicode
formally,
        but OTOH this would make clear that it's an error.

But the main point seems: handling this in a way other than the rest of
the world
makes it quite incompatible. So we could even role our own conversion
functions and
don't care about standards at all, like the Plan9 people.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT