Re: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Feb 23 2001 - 14:48:17 EST


Mark said:

> In somewhat more detail:
>
> In general, a single abstract character corresponds to a single code point.
> However, due to the requirement of compatibility with legacy code sets, plus
> some inherent fuzziness in what constitutes abstract characters, there are
> cases where this is not true:

And I'll try to help with the visualization, by providing prototypical
instances of each of these cases:

>
> - one abstract character can correspond to two different code points

{a with ring above} ==> U+00C5 LATIN CAPITAL LETTER WITH RING ABOVE
                    ==> U+212B ANGSTROM SIGN (singleton canonical equivalence
                                               to U+00C5)

This is only the most notorious example. There are hundreds of such
examples to be found among the CJK Compatibility characters.

> - one abstract character can correspond to a sequence of two code points

{a with ring above} ==> <U+0041, U+030A>

The obvious instances of precomposed characters, and in particular
canonical composed character sequences.

> - one code point can correspond to two different abstract characters

{Latin baseline ellipsis}
                          ==> U+2026 HORIZONTAL ELLIPSIS
{CJK centerline ellipsis}

{Greek capital alpha}
                          ==> U+0391 GREEK CAPITAL LETTER ALPHA
{Coptic capital alpha}

These are instances of unifications for the encoding. Some we deal with
and get on with our lives. Other provoke arguments for disunification,
as for the Coptic example.

> - one code point can correspond to a sequence of two abstract characters

{f} + {i} ==> U+FB01 LATIN SMALL LIGATURE FI

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT