Re: Unused code positions and mapping to Unicode

From: Geoffrey Waigh (anzu@home.com)
Date: Fri Aug 06 1999 - 18:43:39 EDT


Asmus Freytag wrote:
>
> The principled position (P):
>
> >From the definition of the code page, all undefined characters should be
> mapped to 0xFFFD. Doing otherwise would presume a definition for these byte
> values (in terms of the code page - i.e. the fact that Unicode definess
> 0x0080 as a C1 control is irrelevant in the context of an 8-bit code page
> that does not support C1 controls and therefore 0x80 is undefined).
>
> This is also the only position that allows the recipent to flag the illegal
> use of undefined positions in the code page.
>
> The forward looking position (I):
>
> Character sets that are not full, will get fuller! Therefore, tools that
> transparently process 8-bit character sets risk destroying data if unknown
> character codes are folded into a single 'unknown character' code, e.g. FFFD.
>
> Round tripping is important, less for the currently defined character set,
> but more for the future extensions. A tool, written anytime during the
> early 90's and following position (P) would today be filtering the Euro
> character from CP 1252. That's bad.

But if you don't know how to correctly map it, putting it somewhere else
is bad too. People can kludge it with remapping out of the incorrect
spot afterwards, though if they have multiple source codepages that
filled in their holes differently, that might not be such an easy task.

At least with FFFD, the user knows they have to upgrade their software -
which is another reason that having Unicode properties and libraries
in a system supplied standard package rather than done in each
application is a Good Thing. Alas, the politics on that one will
dog us for many years.

Geoffrey

>
> Therefore, position I would pick any set of distinct character codes from
> the private use area to map holes in 8-bit sets. These codes are the only
> ones that implementations are free to used for this purpose.
>
> The forward looking position (II):
>
> This position bends the rules, by mapping 'holes' 0xXX to Unicode
> characters U+00XX. There are two benefits and two problems. Benefit 1 is
> that other tools can cooperate with this tool predictably, even after the
> CP was extended, and before the tool was itself upgraded. (A consumer
> receiving 0x0080 in context of operations on legacy data can re-map the
> character internally, if 0x80 has been assigned new meaning since in the
> corresponding legacy character set).
> Benefit 2: Where legacy implementations have used the holes as private use
> control characters, position II would retain the control character nature.
>
> Problem 1? Benefit 1 and 2 contradict each other.
>
> Problem 2 is that position II leads to dual code points for characters YY
> that have been added to the code page. There will be instances where tools
> produce Unicode data (permanently, not transparently) of the form 0x00YY
> and this will lead to both 00YY and YYYY to be recognized as the new
> character, as long as that new charater is popular enough. You can see that
> happening with the euro.
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT