Re: illegal UTF-8 sequences and mbtowc()

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Sat Oct 30 1999 - 05:20:34 EDT


Ienup Sung wrote on 1999-10-29 19:07 UTC:
> What if I want to collect illegal characters?

Then write your own UTF-8 decoder in the application, if you have such
special processing requirements (which most applications don't). It is
less then 20 lines after all. I am just concerned that the UTF-8 decoder
that comes with the C library should be especially convenient to use and
should lead to uniform results across applications. Not using the error
reporting facilities of mbtowc() seems to me to be the way to achieve
this. Plan9 did the same thing. Unicode has enough code points to allow
in-band error signalling, such that it is not necessary that we need an
extra (more complicated to handle) out-of-band error condition as mbtowc
etc. do provide it. I don't care much, which characters we use in order
to represent in-band a malformed UTF-8 sequence. Possible alternatives
that I could think of:

  a) U+FFFD (with some good will, that is what you can interpret into
     ISO 10646-1 section R.7, which in silly old ISO tradition is written
     again generic enough to the point of uselessness.)

  b) U+0080 (what the Plan9 authors chose before the ISO definition of UTF-8
     was written)

  c) Let's extend UTF-16 to provide an encoding of malformed UTF-8 sequences.
     For instance, we could define in Plane 14 255 bytes that represent
     bytes which were part of an illegal UTF-8 sequence. This would allow
     loss-less UTF-8 -> UTF-16 -> UTF-8 conversion even for arbitrary random
     byte-strings that do not look anything like valid UTF-8. No
     information would be lost. Would there even be space in the BMP for
     this? (The was sufficient space for adding Braille after all!)

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT