RE: Unused code positions and mapping to Unicode

From: Murray Sargent (murrays@microsoft.com)
Date: Thu Aug 05 1999 - 19:55:09 EDT


Windows MultiByteToWideChar() maps unused 0x80 - 0x9F codes to themselves in
order to roundtrip them. In general these codes should be mapped to
themselves since they belong to the C1 control characters, which Unicode
respects. Code pages 125x assign some of these codes, so they don't remain
the same when mapped to Unicode. In any event, they all roundtrip with
respect to a given 125x codepage.

Murray

> -----Original Message-----
> From: Randy Williams [SMTP:sasrsw@wnt.sas.com]
> Sent: Thursday, August 05, 1999 4:36 PM
> To: Unicode List
> Subject: Unused code positions and mapping to Unicode
>
> Folks,
>
> What is the proper mapping to Unicode of unused characters in a legacy
> encoding?
>
> For example, in Windows Cp1252-latin1 encoding, given the code position of
> 0x81:
>
> - It appears that notepad will map this to U+0081.
> - Nadine's book shows it mapped to U+FFFE.
> - Java seems to map it it U+FFFD.
> - The mapping tables on the FTP site have it listed as undefined and
> don't give a Unicode value.
>
> I would think that U+FFFD is right. But if you do that then round-trip
> conversion will not work if there are multiple unused characters in
> the legacy encoding (and for Cp1252 and others there are multiple such
> code postions). Doing what notepad did will solve that, but that seems
> wrong since it is not that character in Unicode. I guess you could use
> the
> private use area to map them to unique positions, but that does not seem
> right
> either. And if I did use the private use area then other applications
> would
> likely not handle on it properly when sent to them. And then what do I do
>
> when the vendor later defines that code position, particularly when the
> vendor
> decides not to give it a new name (as happened in some cases when the Euro
>
> character was added)? I won't be able to tell if this is use of an unused
> code postion or now use of that new character.
>
> What are others doing to handle this? An answer that the data should
> never
> contain those code positions, while an understandable argument, is not
> helpful.
>
> Thanks in advance.
>
> Randy
>
> --------------------------------------------------------------------------
> ----
> Randolph S. Williams
> National Language Support Voice: 919.677.8000
> SAS Institute Inc. Fax: 919.677.4444
> Cary, NC 27513 USA Email:
> Randy.Williams@sas.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT