Re: Error in definition of "compatibility character"?

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Oct 26 2001 - 20:42:45 EDT


David Hopwood wrote:

> > First of all, as Mark pointed out, there are two quite distinct
> > usages of the term in the standard currently.
> >
> > 1. (decomposition) compatibility character
> >
> > That is what D21 is about, and is derived on the basis of
> > the presence or absence of compatibility decompositions.
> >
> > 2. (legacy) compatibility character
> >
> > These are characters that were included in the standard for
> > compatibility with other standards, for crossmapping, or
> > for other legacy interoperability reasons. Sometimes they
> > have compatibility mappings, sometimes they have canonical
> > mappings (see, e.g., all the CJK compatibility ideographs),
> > and sometimes they have no mappings to other Unicode characters.
> >
> > The text of the standard is being rewritten to make the distinction
> > between these two uses of the term clear.
>
> Is there any formal definition of a legacy compatibility character
> in terms of the Unicode data files, or is it only possible to give a
> list? (If the latter, perhaps it would be useful to add a "Legacy"
> property to PropList-n.n.n.txt.)

There is no formal definition.

It might be nice to have a list, but we'd have
to spend a year arguing over its contents. One man's
"compatibility character" is another man's "gotta have it
required character". By some reckoning, all of the precomposed
Latin letters from 8859-1, -2, -3, ... are compatibility
characters, for example.

The problem is that you cannot really divorce the problem of
designating characters as "compatibility characters" (which is
a polite way of denigrating them as unwelcome intruders that
we have to put up with) from the problem of trying to scope
out "Cleanicode" -- what Unicode would have been like if all
the scripts and symbols could have been encoded without having
to take legacy character encoding mistakes, obsolete implementation
practices, encoding committee compromises, and the like into
account. And while numerous people have on occasion threatened
to go off and define Cleanicode, to date no one has, to my
knowledge.

It would take a braver man than I to pull out the marker pen
and divide all 94,140 Unicode characters into the good ones
and the bad ones, and then defend that line against the
thousands of people who would disagree about where the line
was drawn. Frankly, with some exceptions which the UTC has
agreed to call out as particularly egregious, we are probably
all better off just living with the ambiguity -- believing that
the other guys' bad characters are just there for compatibility,
but that my own good characters are full-fledged citizens
with no compatibility brand on their flanks. It's more of
a standards politics issue than an implementation issue.

>
> > In my opinion, rather than just "fixing" the D1 definition
> > of "compatibility character" to match one or the other
> > of these, we need a further clarification of the distinctions,
> > and if necessary new terminology to make it easier to know
> > which of these sets we are talking about.
>
> I'd suggest keeping "compatibility character" for NFKD(c) != NFD(c),
> and call the other definition just "legacy character". After all,
> legacy characters don't have any formal relation to compatibility
> equivalence.

True enough, but then ASCII A..Z are also legacy characters. And
the terminology verges on meaningless.

Also, the Unicode Standard now has its own 11 year legacy of
claiming that various characters "are encoded for compatibility
with XYZ", and calling those characters "compatibility characters".
Trying to fix that now to some new terminology would probably
introduce as much miscomprehension as it would address. We
(the editors) have reluctantly concluded that owning up to
and clarifying the polysemy of "compatibility character" in
the standard is likely the course of least unintended consequences.

--Ken



This archive was generated by hypermail 2.1.2 : Fri Oct 26 2001 - 21:51:43 EDT