Re: Purpose of REPLACEMENT CHARACTER

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Tue Apr 13 1999 - 04:22:30 EDT


John Cowan wrote on 1999-04-11 23:54 UTC:
> > If I implement a UTF-8 -> UCS-2 converter, what shall I do with
> > malformed UTF-8 sequences? ISO 10646-1 in section 2.3c and section R.7
> > clearly requires that malformed UTF-8 sequences are indicated to the
> > user. Is replacing any malformed UTF-8 sequence by 0xFFFD appropriate
> > use of this character? After all, a malformed UTF-8 sequence is in a
> > sense something outside the range of Unicode.
>
> The Plan 9 folks decided no, that an unknown character is not the same as
> an invalid encoding which does not represent any character.
> They map the latter into U+0080, an unused control character.

U+0080 seems a very random pick to me and will show up on Windows as the
euro sign. If they wanted to use an 8-bit control character, then a
better choice would have been U+001A (ASCII SUB), because according to
ISO 6429 and ECMA 35 <ftp://ftp.ecma.ch/ECMA-ST/E035-PDF.PDF>, section
8.3.148:

  "SUB is used in the place of a character that has been found
  to be invalid or in error. SUB is intended to be introduced by
  automatic means."

It is just not clear to me, whether I should introduce a new glyph to be
associated with a C0 control character.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT