Re: compatibility characters (in XML context)

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Nov 14 2003 - 20:35:32 EST

  • Next message: Patrick Andries: "Re: How can I input any Unicode character if I know its hexadecimal code?"

    John Cowan said:

    > Kenneth Whistler scripsit:
    >
    > > However, there were character encoding standards committees,
    > > predating the UTC, which did not understand this principle,
    > > and which encoded a character for the Ångstrom sign as a
    > > separate symbol. In most cases this would not be a problem,
    > > but in at least one East Asian encoding, an Ångstrom sign
    > > was encoded separately from {an uppercase Å of the Latin script},
    > > resulting in two encodings for what really is the same thing,
    > > from a character encoding perspective.
    >
    > But IIRC they did so in two separate character encoding standards
    > which the UTC for reasons of its own decided to treat as one standard.

    Yeah, could be.

    The issue can be seen in JIS X 0208, which has an Ångstrom symbol
    (Row 2, Cell 82), but no accented Latin, and then JIS X 0212,
    which has a bunch of accented Latin, including Å (Row 10, Cell 9).
    They are separate standards, but JIS X 0212 was designed as
    a discontiguous extension of JIS X 0208. You aren't supposed
    to unify its characters against the JIS X 0208 characters.
    The fact that JIS X 0212 basically failed, and has been replaced
    by a rather different JIS X 0213 extension wasn't something that
    could be foreseen in detail back in 1989 when these initial
    repertoires were being collected.

    >
    > > Note that there a also piles of "compability characters" in
    > > Unicode which have no decomposition mapping whatsoever,
    > > and which thus are completely unimpacted by normalization.
    >
    > If someone undertook to prepare a draft list of these, would the
    > UTC consider blessing it, in corrected form? It is disconcerting
    > that the notion "compatibility character" is so fuzzily defined.

    Actually, part of the point of my discussion of compatibility
    characters is to indicate that "compatibility character" per se
    *is* a very fuzzy and contingent concept. It is basically a
    matter more of character encoding history than something that
    should be normatively defined so as to have implementations and
    other specifications depend on in some crucial way. Even longtime
    experts on the UTC will have disagreements regarding just which
    characters are "compatibility characters" and which not. My
    statement that Å (and by implication most other precomposed
    Latin characters) are compatibility characters in a way would
    itself be somewhat contentious. It depends in part on what
    your vision is regarding how Unicode *should* be, as opposed to
    just what it currently is defined to be.

    What matters for implementations and related specifications
    are the normatively defined statuses of certain characters as
    having decomposition mappings that designates them either
    as compatibility decomposable characters or as canonical
    decomposable characters. That status *is* clearly and unambiguously
    defined for every Unicode character.

    Rather than trying to figure out what all the compatibility
    characters are, I think a much more interesting list would be
    the list of all the *useful* characters in Unicode.

    In other words, while the IRG and other committees are busy
    haggling over what the Basic CJK Subset should be, which would
    be useful for small implementations of Han, maybe the rest
    of you could come up with what the Basic Non-CJK Subset of
    Unicode should be, omitting all the accumulated dreck of
    duplications, mistakes, misguided experiments, and modelling
    errors inherited from older encodings (or stuffed into Unicode
    by the UTC or WG2).

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri Nov 14 2003 - 21:10:48 EST