Re: illegal UTF-8 sequences and mbtowc()

From: Markus Kuhn (
Date: Fri Oct 29 1999 - 14:27:19 EDT

Markus Kuhn wrote on 1999-10-29 11:19 UTC:
> It is actually a shame that when C's mbtowc() discovers a malformed UTF-8
> sequence, it cannot signal back how long this bad sequence is. For instance,
> I find it nicer to treat a UTF-8 sequence with the last byte missing as a
> single malformed sequence, not as a sequence of unexpected bytes. This
> is also how I understood the ISO 10646-1 UTF-8 definition text and what
> xterm implements.
> <>

I have done some reading in the ISO C standardand Am. 1, and concluded
that the API (mbtowc, mbstowcs, mbrtowc, mbsrtowcs, etc.) does not
provide any facility with which the UTF-8 decoder can signal how long a
single malformed UTF-8 sequence in the sense of ISO 10646-1 section R.7
is. This means, that a C program using mbtowc() or the like will always
have to treat a 4-byte UTF-8 sequence with the last byte missing just
like three separate 1-byte malformed sequences, as opposed to a single
malformed sequence.

The only choices that a C program has when it encounters a malformed
multi-byte sequence are the following:

  a) Advance the string pointer until the first valid character is decoded
     again. This would lead to any sequence of malformed sequences be treated
     as a single malformed sequence

  b) Advance the string pointer by one. This would lead to every single
     byte in malformed sequences to be treated as a full malformed sequence.

There is however a simple way out of this:

The C library could implement the mbtowc() UTF-8 decoder, such that it
*NEVER* returns -1 to signal that it encountered a malformed sequence.
It could by convention just treat every malformed (and overlong) UTF-8
sequence just like a valid encoding of the REPLACEMENT CHARACTER. I like
the idea of having one error condition less to worry about, and it would
ensure that UTF-8 decoded wide character strings show exactly what also
xterm decodes. For a malformed UTF-8 sequence encoded as U+FFFD,
mbtowc() can return the length of the sequence and thereby jump over the
rest of a single malformed sequence, just like xterm does.

Highly robust handling of UTF-8 is much trickier than one might think at
first, but with hindsight it can be made quite easy by the authors of

What do you think?


Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at,  WWW: <>

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT