Re: illegal UTF-8 sequences and mbtowc()

From: Martin J. Duerst (
Date: Wed Dec 08 1999 - 15:48:58 EST

I'm late to reply to this, but I think it is a very
dangerous proposal. It has a well-known acronym:
GIGO (garbage in, garbage out). The more data is
exchanged between all kinds of components of the
Internet and Web infrastructure without human invention,
the higher the danger that it will be impossible
to figure out where the data came from, what it
was supposed to be, and where the error happened.

Therefore, early error detection is very important!

Regards, Martin.

At 11:48 1999/10/29 -0700, John Cowan wrote:
> Markus Kuhn wrote:
> > There is however a simple way out of this:
> >
> > The C library could implement the mbtowc() UTF-8 decoder, such that it
> > *NEVER* returns -1 to signal that it encountered a malformed sequence.
> > It could by convention just treat every malformed (and overlong) UTF-8
> > sequence just like a valid encoding of the REPLACEMENT CHARACTER.
> This is almost exactly what the Plan 9 implementation does, except that it uses
> a different character, on the grounds that an encoding error is not the same as
> an unrepresentable character (the higher-level recovery strategy, if any,
> is different). The implementers' specific choice was the (basically)
> unused control character U+0080.
> --
> John Cowan
> Schlingt dreifach einen Kreis vom dies / Schliess eurer Aug vor heiliger Schau
> Den er genoss vom Honig-Tau / Und trank die Milch vom Paradies.
> -- Coleridge (tr. Politzer)

#-#-# Martin J. Du"rst, World Wide Web Consortium

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:56 EDT