Re: compatibility characters (in XML context)

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Nov 14 2003 - 18:08:43 EST

  • Next message: John Cowan: "Re: Citing TUS generically"

    Stefan Persson asked:

    > Alexandre Arcouteil wrote:
    > > Is that a clear indication that \u212B is actually a compatibility
    > > character and then should be, according to XML 1.1 recommandation,
    > > replaced by the \u00C5 character ?
    >
    > Isn't U+00C5 a compatibility character for U+0041 U+030A,
    > so that both should be replaced by that?

    O.k., everybody, turn to p. 24 of The Unicode Standard, Version 4.0,
    Figure 2-8 Codespace and Encoded Characters. It is time to go
    to Unicode School(tm).

    There are 3 *abstract characters*:

       an uppercase A of the Latin script
       
       an uppercase Å of the Latin script
       
       a diacritic ring placed above letters in the Latin script
       
    These are potentially encodable units of textual information,
    derived from the orthographic universe associated with Latin
    script usage. They can be "found" in the world as abstractions
    on the basis of graphological analysis, and they exist, from the
    point of view of character encoding committees, a priori.
    They are concepts of character identity, and they don't have
    numbers associated with them.

    Next, character encoding committees get involved, because they
    want numbers associated with abstract characters, so
    that computers can process them as text.

    The Unicode architects noticed (they weren't the first) a
    generality in the Latin script regarding the productive placement
    of diacritics to create new letters. They determined that a
    sufficient encoding for these 3 abstract characters would be:

    U+0041 LATIN CAPITAL LETTER A
    U+030A COMBINING RING ACCENT

    with the abstract character {an uppercase Å of the Latin script}
    representable as a sequence of encoded characters, i.e. as
    <U+0041, U+030A>.

    But, oh ho!, they also noticed the preexistence of important
    character encoding standards created by other character encoding
    committees that represented the first two of these abstract
    characters as:

    0x41 LATIN CAPITAL LETTER A
    0xC5 LATIN CAPITAL LETTER A WITH RING ABOVE

    and which declined to encode the third abstract character, i.e. the
    diacritic ring itself.

    Enter Unicode Design Principles #9 Equivalent Sequences and
    #10 Convertibility. To get off the ground at all, the
    Unicode Standard simply *had* to have 1-to-1 convertibility
    with ISO 8859-1, as well as a large number of other standards.
    As a result, the UTC added the following encoded character:

    U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE

    and decreed that U+00C5 was *canonically equivalent* to the
    sequence <U+0041, U+030A>, thus asserting no difference in
    the interpretation of U+00C5 and of <U+0041, U+030A>.

    Now how does this related to *compatibility* characters?
    Well, yes, in a sense, U+00C5 is a compatibility character.
    It was encoded for compatibility with ISO/IEC 8859-1 (and
    Code Page 850, and a large number of other preexisting
    encoding standards and code pages). It is generally
    recognized as a "good" compatibility character, since it is
    highly useful in practice and in a sense fits within the
    general Unicode model for how things should be done. (This
    differs, for example, from the "bad" compatibility characters
    like U+FDC1 ARABIC LIGATURE FEH WITH MEEM WITH YEH FINAL FORM.)

    However, U+00C5 is not a compatibility decomposable character
    (or "compatibility composite" -- see definitions on p. 23 of
    TUS 4.0). It is, instead, a *canonical* decomposable character.
    (See pp. 71-72 of TUS 4.0.)

    Well, what about the Ångstrom sign, you may ask, since I
    haven't mentioned it yet? The Ångstrom sign is simply
    a use of the abstract character {an uppercase Å of the Latin script},
    much like "g" is a gram sign and "s" is a seconds sign, and
    "m" is a meter sign (as well as being a sign for the prefix
    milli-).

    However, there were character encoding standards committees,
    predating the UTC, which did not understand this principle,
    and which encoded a character for the Ångstrom sign as a
    separate symbol. In most cases this would not be a problem,
    but in at least one East Asian encoding, an Ångstrom sign
    was encoded separately from {an uppercase Å of the Latin script},
    resulting in two encodings for what really is the same thing,
    from a character encoding perspective.

    Once again, the Unicode principles of Equivalent Sequences
    and Convertibility came into play. The UTC encoded

    U+212B ANGSTROM SIGN

    and decreed that U+212B was *canonically equivalent* to the
    sequence <U+0041, U+030A>, thus asserting no difference in
    the interpretation of U+212B (and incidentally, also,
    U+00C5) and of <U+0041, U+030A>.

    Unlike U+00C5, however, U+212B is a "bad" compatibility
    character -- one that the UTC would have wished away if it
    could have. The sign of that badness is that its
    decomposition mapping in the UnicodeData.txt file is a
    *singleton* mapping, ie. a mapping of a single code point
    to another single code point, instead of to a sequence,
    i.e. U+212B --> U+00C5. Such singleton mappings are
    effectively an admission of duplication of character
    encoding. They are present *only* because of a roundtrip
    convertibility issue.

    To sum up so far:

    U+00C5
       is a "good" compatibility character
       is a canonical decomposable character
       is *not* a compatibility decomposable character
       is canonically equivalent to <U+0041, U+030A>
       does not have a singleton decomposition mapping
       
    U+212B
       is a "bad" compatibility character
       is a canonical decomposable character
       is *not* a compatibility decomposable character
       is canonically equivalent to <U+0041, U+030A>
       does have a singleton decomposition mapping

    Now back to the second clause of Stefan's question:

    > Isn't U+00C5 a compatibility character for U+0041 U+030A,
    > so that both should be replaced by that?

    What gets replaced by what depends on the specification
    of normalization. (See UAX #15.)

    For NFD:

       U+00C5 and U+212B are replaced by <U+0041, U+030A>.
       
       <U+0041, U+030A> stays unchanged.
       
    For NFC:

       U+212B and <U+0041, U+030A> are replaced by U+00C5.
       
       U+00C5 stays unchanged.
       
    Normalization is basically completely agnostic about what
    is a "compatibility character", and whether precomposed
    forms should be used or not. One form (NFC) normalizes
    towards precomposed forms; one form (NFD) normalizes
    away from precomposed form, essentially.

    Note that there a also piles of "compability characters" in
    Unicode which have no decomposition mapping whatsoever,
    and which thus are completely unimpacted by normalization.
    Some examples:

    U+2FF0 IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT

    (for compatibility with GBK)

    U+FE73 ARABIC TAIL FRAGMENT

    (for compatibility with some old IBM Arabic code pages)

    The whole block of box drawing characters, U+2500..U+257F

    (for compatibility with numerous old code pages)

    and so on.

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri Nov 14 2003 - 19:12:33 EST