RE: Unused code positions and mapping to Unicode

From: Murray Sargent (murrays@microsoft.com)
Date: Fri Aug 06 1999 - 15:31:00 EDT

Next message: Aleksandar Susnjar: "Decomposable characters with marks (or other combining characterrs)..."
Previous message: Kenneth Whistler: "Re: Unicode 3.0: Update of some data files for the beta"
Maybe in reply to: Randy Williams: "Unused code positions and mapping to Unicode"
Next in thread: John Cowan: "Re: Unused code positions and mapping to Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I would like to underline Ken's remark below "There is a case to be made, in
particular for values in the range 0x80..0x9F in an 8-bit encoding, to just
map them through to Unicode U+0080..U+009F" The point is that Unicode
_does_ define these positions as the C1 controls. As such, they should not
be mapped to 0xFFFD, although undefined character codes should be so mapped.
In the absence of other definitions, the 0x80 - 0x9F codes are best mapped
to themselves.

Murray

> -----Original Message-----
> From: kenw@sybase.com [SMTP:kenw@sybase.com]
> Sent: Thursday, August 05, 1999 5:09 PM
> To: Unicode List
> Cc: unicode@unicode.org; kenw@sybase.com
> Subject: Re: Unused code positions and mapping to Unicode
>
> Randy asked,
>
> >
> > What is the proper mapping to Unicode of unused characters in a legacy
> encoding?
>
> The unfortunate answer is that there is no single right answer. It depends
> on what you are doing.
>
> >
> > For example, in Windows Cp1252-latin1 encoding, given the code position
> of 0x81:
> >
> > - It appears that notepad will map this to U+0081.
> > - Nadine's book shows it mapped to U+FFFE.
>
> This one, at least, can be ruled out. U+FFFE is just wrong, and should be
> U+FFFD in these tables from Nadine Kano's book.
>
> > - Java seems to map it it U+FFFD.
> > - The mapping tables on the FTP site have it listed as undefined and
> > don't give a Unicode value.
>
> Which is probably the right way to define the table. Then an
> implementation
> can choose which way it is going to treat these.
>
> >
> > I would think that U+FFFD is right.
>
> In the general case, yes.
>
> > But if you do that then round-trip
> > conversion will not work if there are multiple unused characters in
> > the legacy encoding (and for Cp1252 and others there are multiple such
> > code postions).
>
> But roundtripping to nonexistent code positions in a character encoding
> is not necessarily a desireable goal anyway.
>
> > Doing what notepad did will solve that, but that seems
> > wrong since it is not that character in Unicode.
>
> There is a case to be made, in particular for values in the range
> 0x80..0x9F
> in an 8-bit encoding, to just map them through to Unicode U+0080..U+009F,
> assuming them to be otherwise unspecified control characters. For
> character
> encodings that obey the C0/C1 restrictions on graphical characters, such
> as the 8859 series, this is most likely to be the right answer. However,
> for Windows code pages and IBM code pages, which stick graphic characters
> in the range 0x80..0x9F, mapping straight through to Unicode controls is
> as likely to be wrong -- and will certainly be wrong in the future if the
> code page in question is extended by adding some specific graphic
> character
> at the formerly undefined position.
>
> > I guess you could use the
> > private use area to map them to unique positions, but that does not seem
> right
> > either. And if I did use the private use area then other applications
> would
> > likely not handle on it properly when sent to them.
>
> You would do this kind of thing if you needed an internal round-tripping,
> but it would be inadvisable to interchange data converted this way openly
> --
> it provides even less information than if you had substituted U+FFFD for
> the unconvertible positions.
>
> > And then what do I do
> > when the vendor later defines that code position, particularly when the
> vendor
> > decides not to give it a new name (as happened in some cases when the
> Euro
> > character was added)? I won't be able to tell if this is use of an
> unused
> > code postion or now use of that new character.
>
> When the vendor later defines a formerly undefined code position, there
> really
> is no feasible alternative to updating your table(s). Once someone starts
> using the newly defined code point, you must map it correctly.
>
> >
> > What are others doing to handle this?
>
> Generically, I map to U+FFFD. And when vendors update their definitions, I
> update my tables.
>
> --Ken
>
> > An answer that the data should never
> > contain those code positions, while an understandable argument, is not
> helpful.
> >
> > Thanks in advance.
> >
> > Randy
> >
> >
> --------------------------------------------------------------------------
> ----
> > Randolph S. Williams
> > National Language Support Voice: 919.677.8000
> > SAS Institute Inc. Fax: 919.677.4444
> > Cary, NC 27513 USA Email:
> Randy.Williams@sas.com
> >

Next message: Aleksandar Susnjar: "Decomposable characters with marks (or other combining characterrs)..."
Previous message: Kenneth Whistler: "Re: Unicode 3.0: Update of some data files for the beta"
Maybe in reply to: Randy Williams: "Unused code positions and mapping to Unicode"
Next in thread: John Cowan: "Re: Unused code positions and mapping to Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT