RE: Unused code positions and mapping to Unicode

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Fri Aug 06 1999 - 18:01:21 EDT


The principled position (P):

From the definition of the code page, all undefined characters should be
mapped to 0xFFFD. Doing otherwise would presume a definition for these byte
values (in terms of the code page - i.e. the fact that Unicode definess
0x0080 as a C1 control is irrelevant in the context of an 8-bit code page
that does not support C1 controls and therefore 0x80 is undefined).

This is also the only position that allows the recipent to flag the illegal
use of undefined positions in the code page.

The forward looking position (I):

Character sets that are not full, will get fuller! Therefore, tools that
transparently process 8-bit character sets risk destroying data if unknown
character codes are folded into a single 'unknown character' code, e.g. FFFD.

Round tripping is important, less for the currently defined character set,
but more for the future extensions. A tool, written anytime during the
early 90's and following position (P) would today be filtering the Euro
character from CP 1252. That's bad.

Therefore, position I would pick any set of distinct character codes from
the private use area to map holes in 8-bit sets. These codes are the only
ones that implementations are free to used for this purpose.

The forward looking position (II):

This position bends the rules, by mapping 'holes' 0xXX to Unicode
characters U+00XX. There are two benefits and two problems. Benefit 1 is
that other tools can cooperate with this tool predictably, even after the
CP was extended, and before the tool was itself upgraded. (A consumer
receiving 0x0080 in context of operations on legacy data can re-map the
character internally, if 0x80 has been assigned new meaning since in the
corresponding legacy character set).
Benefit 2: Where legacy implementations have used the holes as private use
control characters, position II would retain the control character nature.

Problem 1? Benefit 1 and 2 contradict each other.

Problem 2 is that position II leads to dual code points for characters YY
that have been added to the code page. There will be instances where tools
produce Unicode data (permanently, not transparently) of the form 0x00YY
and this will lead to both 00YY and YYYY to be recognized as the new
character, as long as that new charater is popular enough. You can see that
happening with the euro.

At 01:57 PM 8/6/99 -0700, Randy Williams wrote:
>
>Murray,
>
> So are undefined characters outside of the 0x80-0x9F range mapped to
U+FFFD?
>For example, in Cp857-Turkish the code positions of 0xD5, 0xE7, and 0xF2.
>
>Randy
>
>-----Original Message-----
>From: Murray Sargent [mailto:murrays@microsoft.com]
>Sent: Friday, August 06, 1999 3:30 PM
>To: Unicode List
>Cc: unicode@unicode.org
>Subject: RE: Unused code positions and mapping to Unicode
>
>
>I would like to underline Ken's remark below "There is a case to be made, in
>particular for values in the range 0x80..0x9F in an 8-bit encoding, to just
>map them through to Unicode U+0080..U+009F" The point is that Unicode
>_does_ define these positions as the C1 controls. As such, they should not
>be mapped to 0xFFFD, although undefined character codes should be so mapped.
>In the absence of other definitions, the 0x80 - 0x9F codes are best mapped
>to themselves.
>
>Murray
>
>> -----Original Message-----
>> From: kenw@sybase.com [SMTP:kenw@sybase.com]
>> Sent: Thursday, August 05, 1999 5:09 PM
>> To: Unicode List
>> Cc: unicode@unicode.org; kenw@sybase.com
>> Subject: Re: Unused code positions and mapping to Unicode
>>
>> Randy asked,
>>
>> >
>> > What is the proper mapping to Unicode of unused characters in a legacy
>> encoding?
>>
>> The unfortunate answer is that there is no single right answer. It depends
>> on what you are doing.
>>
>> >
>> > For example, in Windows Cp1252-latin1 encoding, given the code position
>> of 0x81:
>> >
>> > - It appears that notepad will map this to U+0081.
>> > - Nadine's book shows it mapped to U+FFFE.
>>
>> This one, at least, can be ruled out. U+FFFE is just wrong, and should be
>> U+FFFD in these tables from Nadine Kano's book.
>>
>> > - Java seems to map it it U+FFFD.
>> > - The mapping tables on the FTP site have it listed as undefined and
>> > don't give a Unicode value.
>>
>> Which is probably the right way to define the table. Then an
>> implementation
>> can choose which way it is going to treat these.
>>
>> >
>> > I would think that U+FFFD is right.
>>
>> In the general case, yes.
>>
>> > But if you do that then round-trip
>> > conversion will not work if there are multiple unused characters in
>> > the legacy encoding (and for Cp1252 and others there are multiple such
>> > code postions).
>>
>> But roundtripping to nonexistent code positions in a character encoding
>> is not necessarily a desireable goal anyway.
>>
>> > Doing what notepad did will solve that, but that seems
>> > wrong since it is not that character in Unicode.
>>
>> There is a case to be made, in particular for values in the range
>> 0x80..0x9F
>> in an 8-bit encoding, to just map them through to Unicode U+0080..U+009F,
>> assuming them to be otherwise unspecified control characters. For
>> character
>> encodings that obey the C0/C1 restrictions on graphical characters, such
>> as the 8859 series, this is most likely to be the right answer. However,
>> for Windows code pages and IBM code pages, which stick graphic characters
>> in the range 0x80..0x9F, mapping straight through to Unicode controls is
>> as likely to be wrong -- and will certainly be wrong in the future if the
>> code page in question is extended by adding some specific graphic
>> character
>> at the formerly undefined position.
>>
>> > I guess you could use the
>> > private use area to map them to unique positions, but that does not seem
>> right
>> > either. And if I did use the private use area then other applications
>> would
>> > likely not handle on it properly when sent to them.
>>
>> You would do this kind of thing if you needed an internal round-tripping,
>> but it would be inadvisable to interchange data converted this way openly
>> --
>> it provides even less information than if you had substituted U+FFFD for
>> the unconvertible positions.
>>
>> > And then what do I do
>> > when the vendor later defines that code position, particularly when the
>> vendor
>> > decides not to give it a new name (as happened in some cases when the
>> Euro
>> > character was added)? I won't be able to tell if this is use of an
>> unused
>> > code postion or now use of that new character.
>>
>> When the vendor later defines a formerly undefined code position, there
>> really
>> is no feasible alternative to updating your table(s). Once someone starts
>> using the newly defined code point, you must map it correctly.
>>
>> >
>> > What are others doing to handle this?
>>
>> Generically, I map to U+FFFD. And when vendors update their definitions, I
>> update my tables.
>>
>> --Ken
>>
>> > An answer that the data should never
>> > contain those code positions, while an understandable argument, is not
>> helpful.
>> >
>> > Thanks in advance.
>> >
>> > Randy
>> >
>> >
>> --------------------------------------------------------------------------
>> ----
>> > Randolph S. Williams
>> > National Language Support Voice: 919.677.8000
>> > SAS Institute Inc. Fax: 919.677.4444
>> > Cary, NC 27513 USA Email:
>> Randy.Williams@sas.com
>> >
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT