Compatibility Characters [CLIP AND SAVE]

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Dec 09 1999 - 21:52:47 EST


Unicadetti,

The term "compatibility character" seems, unfortunately to
lead people into terminal confusion -- even among the cognoscenti.
Asmus just fell into this trap.

So one more time, I will attempt to clarify the seemingly
unclarifiable.

There are *two* (that's right *TWO*) distinct meanings for
the term "compatibility character" in Unicode. I quote from
the Glossary for the Unicode Standard, Version 3.0:

Compatibility Character.

  (1) A character encoded only for compatibility with preexisting
      encoding standards to support transcoding.

  (2) A character that has a compatibility decomposition. (See
      Definition D21 in Section 3.6, Decomposition.)

Examples of compatibility characters in sense 1:

U+00BC VULGAR FRACTION ONE QUARTER [Latin-1]
U+FF21 FULLWIDTH LATIN CAPITAL LETTER A [Code Page 932]
U+01F9 LATIN SMALL LETTER N WITH GRAVE [GBK]
U+212B ANGSTROM SIGN [KSC 5601]
U+FE20 COMBINING LIGATURE LEFT HALF [ISO 5426-1983]

Example of compatibility characters in sense 2:

U+00BC VULGAR FRACTION ONE QUARTER [Latin-1]
U+FF21 FULLWIDTH LATIN CAPITAL LETTER A [Code Page 932]
U+FC21 ARABIC LIGATURE SAD WITH MEEM ISOLATED FORM
U+01F1 LATIN CAPITAL LETTER DZ
U+02DB OGONEK

Please note carefully that not all compatibility characters
in sense 1 are also compatibility characters in sense 2:
U+01F9 LATIN SMALL LETTER N WITH GRAVE was encoded for
transcoding compatibility with GBK, but has a *canonical*
decomposition. U+212B ANGSTROM SIGN was encoded for transcoding
compatibility with KSC 5601, but has a *canonical* decomposition.
U+FE20 COMBINING LIGATURE LEFT HALF was encoded for transcoding
compatibility with ISO 5426-1982, but has *no* decomposition.

Please also note carefully that not all compatibility characters
in sense 2 are also compatibility characters in sense 1:
U+FC21 ARABIC LIGATURE SAD WITH MEEM ISOLATED FORM did not
come from a preexisting standard; it was created *for* 10646 by
the Egyptian committee. U+01F1 LATIN CAPITAL LETTER DZ did
not come from a preexisting standard; it was created *for*
10646 by the Slovenian committee. U+02DB OGONEK did not come
from a preexisting source standard (although by now it may
be crossmapped to a bibliographic standard); it was created *for*
the Unicode Standard by the UTC to round out a set of spacing
clones for non-spacing marks.

Finally, please note carefully that not all compatibility
characters (in either sense) come from the so-called "compatibility
zone" -- a term which itself has been almost completely extinguished
in the current version of the Unicode Standard, because it was causing
further confusion. There are compatibility characters in both
senses all over the encoding space, and there are non-compatibility
characters in the "compatibility zone": Yiddish precomposed characters,
Arabic ornate parentheses, and ZWNBSP (BOM).

Clip and file this note somewhere you can search for "compatibility
character", and before asking about or making claims about
compatibility characters in Unicode on the list, consult this first.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:56 EDT