RE: illegal UTF-8 sequences and mbtowc()

From: Marco.Cimarosti@icl.com
Date: Tue Nov 02 1999 - 10:52:51 EST


If I understand it, Markus' idea of using a surrogate pair is great! The egg
of Columbus, IMHO.

There is no way that an UTF-8 decoder could generate a surrogate pair from a
legal sequence representing a value in range U+0000 to U+00FF, so it is a
good indicator of an error as anyting else (U+FFFF, U+0080, etc.).

Yet, when you encode the surrogate pair back to UTF-8 it is converted to the
same single offending byte, thus preserving the original byte sequence --
whether or not it was good UTF-8.

This allows things that, theoretically, should not be allowed but are
practically needed in everyday life: such as loading (by mistake) a
non-UTF-8 text file in a program expecting UTF-8, and saving it back without
currupting the original content.

Regards.
        Marco

> -----Original Message-----
> From: Markus Kuhn [SMTP:Markus.Kuhn@cl.cam.ac.uk]
> Sent: 1999 October 30, Saturday 13.45
> To: Unicode List
> Cc: Unicode List
> Subject: Re: illegal UTF-8 sequences and mbtowc()
>
> Henning Brunzel wrote on 1999-10-30 10:39 UTC:
> > Markus Kuhn wrote:
> > > c) Let's extend UTF-16 to provide an encoding of malformed UTF-8
> sequences.
> > > For instance, we could define in Plane 14 255 bytes that
> represent
> > > bytes which were part of an illegal UTF-8 sequence. This would
> allow
> > > loss-less UTF-8 -> UTF-16 -> UTF-8 conversion even for arbitrary
> random
> > > byte-strings that do not look anything like valid UTF-8.
> > IIRC the starting point for this was to get only one code for every
> > malformed sequence instead of every byte. This proposal would actually
> > get two codes per byte. Am I missing something?
>
> Think of a surrogate pair as a single code in UTF-16.
>
> Markus
>
> --
> Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
> Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT