RE: Unused code positions and mapping to Unicode

From: Randy Williams (sasrsw@wnt.sas.com)
Date: Fri Aug 06 1999 - 17:03:01 EDT


Murray,

  So are undefined characters outside of the 0x80-0x9F range mapped to U+FFFD?
For example, in Cp857-Turkish the code positions of 0xD5, 0xE7, and 0xF2.

Randy

-----Original Message-----
From: Murray Sargent [mailto:murrays@microsoft.com]
Sent: Friday, August 06, 1999 3:30 PM
To: Unicode List
Cc: unicode@unicode.org
Subject: RE: Unused code positions and mapping to Unicode

I would like to underline Ken's remark below "There is a case to be made, in
particular for values in the range 0x80..0x9F in an 8-bit encoding, to just
map them through to Unicode U+0080..U+009F" The point is that Unicode
_does_ define these positions as the C1 controls. As such, they should not
be mapped to 0xFFFD, although undefined character codes should be so mapped.
In the absence of other definitions, the 0x80 - 0x9F codes are best mapped
to themselves.

Murray

> -----Original Message-----
> From: kenw@sybase.com [SMTP:kenw@sybase.com]
> Sent: Thursday, August 05, 1999 5:09 PM
> To: Unicode List
> Cc: unicode@unicode.org; kenw@sybase.com
> Subject: Re: Unused code positions and mapping to Unicode
>
> Randy asked,
>
> >
> > What is the proper mapping to Unicode of unused characters in a legacy
> encoding?
>
> The unfortunate answer is that there is no single right answer. It depends
> on what you are doing.
>
> >
> > For example, in Windows Cp1252-latin1 encoding, given the code position
> of 0x81:
> >
> > - It appears that notepad will map this to U+0081.
> > - Nadine's book shows it mapped to U+FFFE.
>
> This one, at least, can be ruled out. U+FFFE is just wrong, and should be
> U+FFFD in these tables from Nadine Kano's book.
>
> > - Java seems to map it it U+FFFD.
> > - The mapping tables on the FTP site have it listed as undefined and
> > don't give a Unicode value.
>
> Which is probably the right way to define the table. Then an
> implementation
> can choose which way it is going to treat these.
>
> >
> > I would think that U+FFFD is right.
>
> In the general case, yes.
>
> > But if you do that then round-trip
> > conversion will not work if there are multiple unused characters in
> > the legacy encoding (and for Cp1252 and others there are multiple such
> > code postions).
>
> But roundtripping to nonexistent code positions in a character encoding
> is not necessarily a desireable goal anyway.
>
> > Doing what notepad did will solve that, but that seems
> > wrong since it is not that character in Unicode.
>
> There is a case to be made, in particular for values in the range
> 0x80..0x9F
> in an 8-bit encoding, to just map them through to Unicode U+0080..U+009F,
> assuming them to be otherwise unspecified control characters. For
> character
> encodings that obey the C0/C1 restrictions on graphical characters, such
> as the 8859 series, this is most likely to be the right answer. However,
> for Windows code pages and IBM code pages, which stick graphic characters
> in the range 0x80..0x9F, mapping straight through to Unicode controls is
> as likely to be wrong -- and will certainly be wrong in the future if the
> code page in question is extended by adding some specific graphic
> character
> at the formerly undefined position.
>
> > I guess you could use the
> > private use area to map them to unique positions, but that does not seem
> right
> > either. And if I did use the private use area then other applications
> would
> > likely not handle on it properly when sent to them.
>
> You would do this kind of thing if you needed an internal round-tripping,
> but it would be inadvisable to interchange data converted this way openly
> --
> it provides even less information than if you had substituted U+FFFD for
> the unconvertible positions.
>
> > And then what do I do
> > when the vendor later defines that code position, particularly when the
> vendor
> > decides not to give it a new name (as happened in some cases when the
> Euro
> > character was added)? I won't be able to tell if this is use of an
> unused
> > code postion or now use of that new character.
>
> When the vendor later defines a formerly undefined code position, there
> really
> is no feasible alternative to updating your table(s). Once someone starts
> using the newly defined code point, you must map it correctly.
>
> >
> > What are others doing to handle this?
>
> Generically, I map to U+FFFD. And when vendors update their definitions, I
> update my tables.
>
> --Ken
>
> > An answer that the data should never
> > contain those code positions, while an understandable argument, is not
> helpful.
> >
> > Thanks in advance.
> >
> > Randy
> >
> >
> --------------------------------------------------------------------------
> ----
> > Randolph S. Williams
> > National Language Support Voice: 919.677.8000
> > SAS Institute Inc. Fax: 919.677.4444
> > Cary, NC 27513 USA Email:
> Randy.Williams@sas.com
> >



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT