Re: illegal UTF-8 sequences and mbtowc()

From: John Cowan (cowan@ccil.org)
Date: Fri Oct 29 1999 - 14:48:35 EDT

Next message: Marion Gunn: "Re: arabic number in bidi algorithm"
Previous message: Markus Kuhn: "Re: illegal UTF-8 sequences and mbtowc()"
Maybe in reply to: Markus Kuhn: "Re: illegal UTF-8 sequences and mbtowc()"
Next in thread: Markus Kuhn: "Re: illegal UTF-8 sequences and mbtowc()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Markus Kuhn wrote:

> There is however a simple way out of this:
>
> The C library could implement the mbtowc() UTF-8 decoder, such that it
> *NEVER* returns -1 to signal that it encountered a malformed sequence.
> It could by convention just treat every malformed (and overlong) UTF-8
> sequence just like a valid encoding of the REPLACEMENT CHARACTER.

This is almost exactly what the Plan 9 implementation does, except that it uses
a different character, on the grounds that an encoding error is not the same as
an unrepresentable character (the higher-level recovery strategy, if any,
is different). The implementers' specific choice was the (basically)
unused control character U+0080.

John Cowan http://www.reutershealth.com jcowan@reutershealth.com Schlingt dreifach einen Kreis vom dies / Schliess eurer Aug vor heiliger Schau Den er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)

Next message: Marion Gunn: "Re: arabic number in bidi algorithm"
Previous message: Markus Kuhn: "Re: illegal UTF-8 sequences and mbtowc()"
Maybe in reply to: Markus Kuhn: "Re: illegal UTF-8 sequences and mbtowc()"
Next in thread: Markus Kuhn: "Re: illegal UTF-8 sequences and mbtowc()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT