Re: illegal UTF-8 sequences and mbtowc()

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Sat Oct 30 1999 - 07:47:24 EDT


Henning Brunzel wrote on 1999-10-30 10:39 UTC:
> Markus Kuhn wrote:
> > c) Let's extend UTF-16 to provide an encoding of malformed UTF-8 sequences.
> > For instance, we could define in Plane 14 255 bytes that represent
> > bytes which were part of an illegal UTF-8 sequence. This would allow
> > loss-less UTF-8 -> UTF-16 -> UTF-8 conversion even for arbitrary random
> > byte-strings that do not look anything like valid UTF-8.
> IIRC the starting point for this was to get only one code for every
> malformed sequence instead of every byte. This proposal would actually
> get two codes per byte. Am I missing something?

Think of a surrogate pair as a single code in UTF-16.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT