Re: illegal UTF-8 sequences and mbtowc()

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Sat Oct 30 1999 - 07:47:24 EDT

Next message: Constantine Stathopoulos: "Re: XML versions of Bible and Quran"
Previous message: Henning Brunzel: "Re: illegal UTF-8 sequences and mbtowc()"
Maybe in reply to: Markus Kuhn: "Re: illegal UTF-8 sequences and mbtowc()"
Next in thread: Henning Brunzel: "Re: illegal UTF-8 sequences and mbtowc()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Henning Brunzel wrote on 1999-10-30 10:39 UTC:
> Markus Kuhn wrote:
> > c) Let's extend UTF-16 to provide an encoding of malformed UTF-8 sequences.
> > For instance, we could define in Plane 14 255 bytes that represent
> > bytes which were part of an illegal UTF-8 sequence. This would allow
> > loss-less UTF-8 -> UTF-16 -> UTF-8 conversion even for arbitrary random
> > byte-strings that do not look anything like valid UTF-8.
> IIRC the starting point for this was to get only one code for every
> malformed sequence instead of every byte. This proposal would actually
> get two codes per byte. Am I missing something?

Think of a surrogate pair as a single code in UTF-16.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

Next message: Constantine Stathopoulos: "Re: XML versions of Bible and Quran"
Previous message: Henning Brunzel: "Re: illegal UTF-8 sequences and mbtowc()"
Maybe in reply to: Markus Kuhn: "Re: illegal UTF-8 sequences and mbtowc()"
Next in thread: Henning Brunzel: "Re: illegal UTF-8 sequences and mbtowc()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT