L2/01-301


Analysis of Character Deprecation in the Unicode Standard

Ken Whistler
August 1, 2001

Mark Davis has suggested that a character property of "deprecated" be
added to the Unicode Character Database, to track those characters
that have been deprecated in the standard.

The problem I see is that to date there are many different kinds
of deprecation and "discouragement" of various characters, so that it
isn't exactly clear what we mean by deprecation and which exact list
of characters should be included in it.

The *definition* of deprecation currently given in the standard
is:

D7a: Deprecated character: a coded character whose use is strongly
     discouraged. Such characters are retained in the standard, but
     should not be used.

(Chapter 3, page 41)

This needs to be compared with the definition and notes for "compatibility
character", as well:

D21 Compatibility character: a character that has a compatibility
    decomposition.

    * ... They support transmission and processing of legacy data.
      Their use is discouraged other than for legacy data.

===================================================================

Here is the complete list of characters that have, so far, been
labeled, indicated, or implicated as "deprecated" or "discouraged"
in the standard.

Rick McGowan originally compiled this list, and I have rearranged
and annotated it.

A. Labelled as "deprecated"

1. Vietnamese combining tone marks

0340	COMBINING GRAVE TONE MARK (Vietnamese)
0341	COMBINING ACUTE TONE MARK (Vietnamese)

These were belatedly recognized as mistaken, duplicate encodings,
and were formally deprecated by the UTC.

2. Alternate format controls inherited from 10646

206A	INHIBIT SYMMETRIC SWAPPING
206B	ACTIVATE SYMMETRIC SWAPPING
206C	INHIBIT ARABIC FORM SHAPING
206D	ACTIVATE ARABIC FORM SHAPING
206E	NATIONAL DIGIT SHAPES
206F	NOMINAL DIGIT SHAPES

These were recognized as "really bad" and were formally deprecated
by the UTC when they first went into the Unicode Standard.

B. Labelled as "strongly discouraged"

1. 3-part Tibetan vowel signs with a-chung's

0F77	TIBETAN VOWEL SIGN VOCALIC RR
0F79	TIBETAN VOWEL SIGN VOCALIC LL

These multi-part vowels are not needed, and have canonical decompositions
involving another multi-part vowel 0F81 which itself is "discouraged".

C. Labelled as "discouraged"

1. 2-part Tibetan vowel signs with a-chung's

0F73	TIBETAN VOWEL SIGN II
0F75	TIBETAN VOWEL SIGN UU
0F81	TIBETAN VOWEL SIGN REVERSED II

These 2-part Tibetan vowels are not needed. Their canonical decompositions 
are to sequences of combining marks.

2. 2-part Greek accent

0344	COMBINING GREEK DIALYTIKA TONOS

Its canonical decomposition is to a sequence of combining marks.

D. Indicated as "strongly discouraged", but reserved for use with special
protocols.

1. Tag Characters

E0001	LANGUAGE TAG
...
E007F	CANCEL TAG

These were born "strongly discouraged" by the UTC, but were not marked
as deprecated, since they were put in explicitly for particular protocol
usage.

E. Indicated as "strongly discouraged" for plain text interchange

1. Interlinear Annotation Characters

FFF9	INTERLINEAR ANNOTATION ANCHOR
FFFA	INTERLINEAR ANNOTATION SEPARATOR
FFFB	INTERLINEAR ANNOTATION TERMINATOR

See p. 326 of TUS 3.0. "Usage of the annotation character in plain text
interchange is strongly discouraged without prior agreement between
the sender and the receiver..."  This is another way of saying that
they are reserved for use with a higher-level protocol.

Then we have groups of characters that are not overtly labelled
as deprecated or discouraged, but for which there are implied
discouragements by reason of their belonging to disparaged
classes of characters.

F. Indicated as "strongly discouraged" "in general"

1. Letterlike symbols "that are merely font variants or alternative
representations of other character sequences." (see TUS 3.0, p. 298)

This presumably was intended to apply to all the letterlike symbols
in the range 2100..213A that have a "<font>" or "<compat>" compatibility 
equivalence.

But the exact list is unclear. It probably should not include
the Hebrew letterlike math symbols, 2135..2138, which also have a
*directional* difference. And it probably *should* include the
two instances (212A KELVIN SIGN and 212B ANGSTROM SIGN) that
have canonical equivalences.

Some of the letterlike math symbols in the letterlike symbols
block also have to be un-discouraged, to match the text for
the Plane 1 mathematical alphanumeric symbols, whose repertoire
they complete. The Plane 1 mathematical alphanumeric symbols
are "intended for use only in mathematical or technical
notation; they are not intended for use in non-technical text."
This does not constitute a generic discouragement of use, but
rather constraining their use to particular kinds of text.

G. Implicated as "discouraged" for any use but legacy data.

1. All "compatibility" characters.

The problem here is the ambiguity between the two senses of
compatibility characters. Not all compatibility characters in
the sense of characters encoded for legacy compatibility with
preexisting standards or usage have compatibility decompositions.
Presumably it is the broader sense of compatibility characters
that is intended for discouragement here. But we don't have
a specified or specifiable list of all compatibility characters
in the broader sense.

===================================================================

I think that *deprecation* should be a formal action taken by
the UTC degrading the status of a character from "approved for
general use" to "disapproved for general use".

It should require a permanent, formal statement, included as
part of the standard (via a UAX, for example), of the reasons
for the deprecation.

If such a discipline is followed, then it will be meaningful
to have a formal character property which indicates the status
of an encoded character as deprecated, since such status will
be well-defined.

And to have teeth, deprecation ought to have some conformance
implications as well. We cannot actually remove deprecated
characters from the standard, but we ought to have a way
for conforming processes to indicate that they do not support
deprecated characters. Furthermore, it should be a given that
other standards referencing the Unicode Standard would, by
default, not make use of deprecated characters, either.

"Discouragement of use", on the other hand, should be distinguished
from deprecation. It is not a formal status decreed by the UTC,
but instead constitutes an implementation guideline, and should
be taken as informative only, and subject to editorial updates
as needed.

Because of this, I think the definition of "deprecation" currently
in the standard should be tightened up and turned into something
that reflects a specific UTC decision. As it stands now, it is
not possible to determine which characters are actually
deprecated by the definition and which are not.

If the UTC decides that a particular strongly discouraged
character or group of characters can cause problems that are
severe enough to warrant a formal recommendation of their non-use,
then it can vote to deprecate them and add them to the formal
list of deprecated characters. But I do not think that we
should have a "discouraged" character property, precisely because
we are so fuzzy in its application, ranging from some particular 
"strongly discouraged" characters that probably ought to be
formally deprecated, to general discouragement of the use
of all compatibility characters. Also, since discouragement
of use is partly in the eye of the beholder, depending on what
kinds of implementations one is doing, we risk resurrecting
the civil wars between the Cleanicode advocates and the
Unicode for legacy support advocates in the committee if we have
to pin this stuff down more formally.