From: Markus Kuhn (
Date: Tue Apr 13 1999 - 04:22:30 EDT

John Cowan wrote on 1999-04-11 23:54 UTC:
> > If I implement a UTF-8 -> UCS-2 converter, what shall I do with
> > malformed UTF-8 sequences? ISO 10646-1 in section 2.3c and section R.7
> > clearly requires that malformed UTF-8 sequences are indicated to the
> > user. Is replacing any malformed UTF-8 sequence by 0xFFFD appropriate
> > use of this character? After all, a malformed UTF-8 sequence is in a
> > sense something outside the range of Unicode.
> The Plan 9 folks decided no, that an unknown character is not the same as
> an invalid encoding which does not represent any character.
> They map the latter into U+0080, an unused control character.

U+0080 seems a very random pick to me and will show up on Windows as the
euro sign. If they wanted to use an 8-bit control character, then a
better choice would have been U+001A (ASCII SUB), because according to
ISO 6429 and ECMA 35 <>, section

  "SUB is used in the place of a character that has been found
  to be invalid or in error. SUB is intended to be introduced by
  automatic means."

It is just not clear to me, whether I should introduce a new glyph to be
associated with a C0 control character.


Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at,  WWW: <>

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT