Re: need examples of one-to-one code point correspondence exceptions

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Dec 01 2008 - 19:48:25 CST


Suhel Jaber asked:

> can anybody give
> me examples in which one code point in another standard corresponds to
> a sequence of code points in Unicode and vice versa, with the
> explanation of the reason why the one-to-one code point correspondence
> policy exception has been made by the Consortium regarding those
> examples?

Sure. JIS X 0213 contains a number of characters which
map to a sequence of 2 Unicode characters. See, for
example 0x82F5 (in the Shift-JIS encoding of JIS X 0213):

http://x0213.org/codetable/sjis-0213-2004-std.txt

That maps to the Unicode sequence: <U+304B, U+309A>

In this instance, as in a few others, the standard in question
came *after* normalization was defined for the Unicode Standard,
at which point including new precomposed characters that
could already be expressed as a sequence of base character
plus combining mark became a mostly useless exercise.
JIS X 0213 contains, for example, some new combinations
of kana plus voicing mark that could already be expressed
as a Unicode sequence. So rather than encode new characters in
Unicode simply to enable a one-to-one mapping, the mapping
is done to the existing sequence, instead.

The guarantees for one-to-one mapping against existing standards
only applied to standards covered by the earlier versions
of Unicode. And the reasons for that earlier policy had
more to do with the limitations of interoperability in
a transition phase *to* Unicode systems, rather than with
the situation now, where more and more systems simply
*are* Unicode systems and where the interoperability concerns
are more isolated to conversion transducers which are
more capable of handling non-one-to-one conversions.

Note that for *some* standards -- even early ones predating
Unicode -- there was always a need for complex non-one-to-one
conversions. ISCII was such an example -- it required
some context analysis to convert correctly. And
conversion of any of the old PC Arabic code pages to "correct"
Unicode -- as opposed to simply mapping one-to-one to
compatibility code points -- also always required contextual
analysis and non-one-to-one conversions.

--Ken



This archive was generated by hypermail 2.1.5 : Fri Jan 02 2009 - 15:33:07 CST