Unused code positions and mapping to Unicode

From: Randy Williams (sasrsw@wnt.sas.com)
Date: Thu Aug 05 1999 - 19:42:08 EDT


What is the proper mapping to Unicode of unused characters in a legacy encoding?

For example, in Windows Cp1252-latin1 encoding, given the code position of 0x81:

 - It appears that notepad will map this to U+0081.
 - Nadine's book shows it mapped to U+FFFE.
 - Java seems to map it it U+FFFD.
 - The mapping tables on the FTP site have it listed as undefined and
   don't give a Unicode value.

I would think that U+FFFD is right. But if you do that then round-trip
conversion will not work if there are multiple unused characters in
the legacy encoding (and for Cp1252 and others there are multiple such
code postions). Doing what notepad did will solve that, but that seems
wrong since it is not that character in Unicode. I guess you could use the
private use area to map them to unique positions, but that does not seem right
either. And if I did use the private use area then other applications would
likely not handle on it properly when sent to them. And then what do I do
when the vendor later defines that code position, particularly when the vendor
decides not to give it a new name (as happened in some cases when the Euro
character was added)? I won't be able to tell if this is use of an unused
code postion or now use of that new character.

What are others doing to handle this? An answer that the data should never
contain those code positions, while an understandable argument, is not helpful.

Thanks in advance.


                         Randolph S. Williams
National Language Support Voice: 919.677.8000
SAS Institute Inc. Fax: 919.677.4444
Cary, NC 27513 USA Email: Randy.Williams@sas.com

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT