Re: illegal UTF-8 sequences and mbtowc()

From: Henning Brunzel (
Date: Sat Oct 30 1999 - 11:09:24 EDT

Henning Brunzel wrote:
> But the main point seems: handling this in a way other than the rest of
> the world
> makes it quite incompatible. So we could even role our own conversion
> functions and
> don't care about standards at all, like the Plan9 people.

After thinking about this twice, I think there is a way around this:
Have a global variable like __mbtowcs_replacement_character (or whatever
is posixly correct) which is by default set to something outside UCS-4,
say -1.
Have a macro like __MBTOWCS_HAVE_REPLACEMENT_CHARACTER (or whatever is
posixly correct)

Then we could do something like:
__mbtowcs_replacement_character = 0xfffd;
do_your_conversion_without_care_about_all_this ();
do_your_conversion_the_normal_way_with_too_many_errors ();

Additionally one could do something similar with a replacement offset
and get all illegal bytes mapped to Plane 14 or whereever you want them.

Of course when __mbtowcs_replacement_character is -1, the conversion
functions have to report all errors in the usual way.

This is perhaps the best one can do for now, but it still won't give
you complete binary-round-trip-compatibility. Whatever you choose as
replacement for your bytes might be encoded as valid UTF-8 sequence
in the input. You could only put your replacement offset after the
end of UCS-4 (is this U-80000000 ?), but then again you would need
special wctomb routines that can handle this. But this is even only
UTF-8 -> int -> UTF-8 conversion. With UTF-16, I'm afraid you're
lost because you don't have a single code which isn't representable
in UTF-8.

Will someone code this for glibc?

Do You Yahoo!?
Bid and sell for free at

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT