Re: illegal UTF-8 sequences and mbtowc()

From: John Cowan (cowan@ccil.org)
Date: Fri Oct 29 1999 - 14:48:35 EDT


Markus Kuhn wrote:

> There is however a simple way out of this:
>
> The C library could implement the mbtowc() UTF-8 decoder, such that it
> *NEVER* returns -1 to signal that it encountered a malformed sequence.
> It could by convention just treat every malformed (and overlong) UTF-8
> sequence just like a valid encoding of the REPLACEMENT CHARACTER.

This is almost exactly what the Plan 9 implementation does, except that it uses
a different character, on the grounds that an encoding error is not the same as
an unrepresentable character (the higher-level recovery strategy, if any,
is different). The implementers' specific choice was the (basically)
unused control character U+0080.

-- 

John Cowan http://www.reutershealth.com jcowan@reutershealth.com Schlingt dreifach einen Kreis vom dies / Schliess eurer Aug vor heiliger Schau Den er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT