Re: compatibility characters (in XML context)

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Nov 14 2003 - 15:12:08 EST

  • Next message: Philippe Verdy: "Re: compatibility characters (in XML context)"

    Alexandre,

    > Philippe Verdy wrote:
    >
    > > From: "Kent Karlsson" <kentk@cs.chalmers.se>
    > >
    > >>Philippe Verdy wrote:
    > >>
    > >>> (1) a singleton (example the Angström symbol, canonically
    > >>>mapped to A with diaeresis,
    > >>
    > >>The Ångström (note spelling) sign is canonically mapped to
    > >>capital a with ring.
    >
    > Thanks for all explanations,

    Please disregard Philippe's misleading blatherings on this
    topic.

    The place to start is to read Unicode Technical Report #20,
    Unicode in XML and other Markup Languages (despite Philippe's
    disclaimers about it).

    See, in particular, Section 5 of that report, "Characters
    with Compatibility Mappings", which provides a series
    of recommendations for things to do and not to do for
    compatibility characters in an XML context.

    >
    > Keeping the A with ring exemple, does it means that compatibility
    > characters can be identified according to Unicode charts ?

    See section 2.3 Compatibility Characters in the Unicode Standard:

    http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf

    In general, compatibility characters cannot be identified
    simply by looking at the Unicode code charts. The subset
    of compatibility characters known as compatibility composite
    characters *can* be identified by their decompositions listed
    in the names list sections of the Unicode code chart. Or you
    can parse them mechanically out of the UnicodeData.txt file
    in the Unicode Character Database online.

    U+212B ANGSTROM SIGN *is* a compatibility character in the
    first sense defined in Section 2.3 of the standard. It is
    not, however, a compatibility composite character.

    > By exemple, in the case of \u212B ANGSTROM SIGN, it is documented :
    > "preferred representation is 00C5 Å latin capital letter a with ring".
    >
    > Is that a clear indication that \u212B is actually a compatibility
    > character

    No, it is not. Such comments occur regarding other characters
    which may or may not be compatibility characters.

    > and then should be, according to XML 1.1 recommandation,
    > replaced by the \u00C5 character ?

    The reason has to do with normalization. U+212B *is* a
    compatibility character. It is *not* a compatibility
    composite character. But the crucial factor is that it
    has a singleton canonical decomposition. If you normalize
    text data using Unicode normalization form NFC, as recommended
    by the W3C, then U+212B with be replaced by U+00C5, as
    a result of the normalization.

    This stuff *is* rather confusing for people encountering it the
    first time. But the above sources should help. Also see
    the W3C working draft for the Character Model for the World Wide Web
    1.0:

    http://www.w3.org/TR/charmod/

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri Nov 14 2003 - 16:06:50 EST