Re: illegal UTF-8 sequences and mbtowc()

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Fri Oct 29 1999 - 14:27:19 EDT


Markus Kuhn wrote on 1999-10-29 11:19 UTC:
> It is actually a shame that when C's mbtowc() discovers a malformed UTF-8
> sequence, it cannot signal back how long this bad sequence is. For instance,
> I find it nicer to treat a UTF-8 sequence with the last byte missing as a
> single malformed sequence, not as a sequence of unexpected bytes. This
> is also how I understood the ISO 10646-1 UTF-8 definition text and what
> xterm implements.
> <http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html>

I have done some reading in the ISO C standardand Am. 1, and concluded
that the API (mbtowc, mbstowcs, mbrtowc, mbsrtowcs, etc.) does not
provide any facility with which the UTF-8 decoder can signal how long a
single malformed UTF-8 sequence in the sense of ISO 10646-1 section R.7
is. This means, that a C program using mbtowc() or the like will always
have to treat a 4-byte UTF-8 sequence with the last byte missing just
like three separate 1-byte malformed sequences, as opposed to a single
malformed sequence.

The only choices that a C program has when it encounters a malformed
multi-byte sequence are the following:

  a) Advance the string pointer until the first valid character is decoded
     again. This would lead to any sequence of malformed sequences be treated
     as a single malformed sequence

  b) Advance the string pointer by one. This would lead to every single
     byte in malformed sequences to be treated as a full malformed sequence.

There is however a simple way out of this:

The C library could implement the mbtowc() UTF-8 decoder, such that it
*NEVER* returns -1 to signal that it encountered a malformed sequence.
It could by convention just treat every malformed (and overlong) UTF-8
sequence just like a valid encoding of the REPLACEMENT CHARACTER. I like
the idea of having one error condition less to worry about, and it would
ensure that UTF-8 decoded wide character strings show exactly what also
xterm decodes. For a malformed UTF-8 sequence encoded as U+FFFD,
mbtowc() can return the length of the sequence and thereby jump over the
rest of a single malformed sequence, just like xterm does.

Highly robust handling of UTF-8 is much trickier than one might think at
first, but with hindsight it can be made quite easy by the authors of
libraries.

What do you think?

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT