L2/01-301 Analysis of Character Deprecation in the Unicode Standard Ken Whistler August 1, 2001 Mark Davis has suggested that a character property of "deprecated" be added to the Unicode Character Database, to track those characters that have been deprecated in the standard. The problem I see is that to date there are many different kinds of deprecation and "discouragement" of various characters, so that it isn't exactly clear what we mean by deprecation and which exact list of characters should be included in it. The *definition* of deprecation currently given in the standard is: D7a: Deprecated character: a coded character whose use is strongly discouraged. Such characters are retained in the standard, but should not be used. (Chapter 3, page 41) This needs to be compared with the definition and notes for "compatibility character", as well: D21 Compatibility character: a character that has a compatibility decomposition. * ... They support transmission and processing of legacy data. Their use is discouraged other than for legacy data. =================================================================== Here is the complete list of characters that have, so far, been labeled, indicated, or implicated as "deprecated" or "discouraged" in the standard. Rick McGowan originally compiled this list, and I have rearranged and annotated it. A. Labelled as "deprecated" 1. Vietnamese combining tone marks 0340 COMBINING GRAVE TONE MARK (Vietnamese) 0341 COMBINING ACUTE TONE MARK (Vietnamese) These were belatedly recognized as mistaken, duplicate encodings, and were formally deprecated by the UTC. 2. Alternate format controls inherited from 10646 206A INHIBIT SYMMETRIC SWAPPING 206B ACTIVATE SYMMETRIC SWAPPING 206C INHIBIT ARABIC FORM SHAPING 206D ACTIVATE ARABIC FORM SHAPING 206E NATIONAL DIGIT SHAPES 206F NOMINAL DIGIT SHAPES These were recognized as "really bad" and were formally deprecated by the UTC when they first went into the Unicode Standard. B. Labelled as "strongly discouraged" 1. 3-part Tibetan vowel signs with a-chung's 0F77 TIBETAN VOWEL SIGN VOCALIC RR 0F79 TIBETAN VOWEL SIGN VOCALIC LL These multi-part vowels are not needed, and have canonical decompositions involving another multi-part vowel 0F81 which itself is "discouraged". C. Labelled as "discouraged" 1. 2-part Tibetan vowel signs with a-chung's 0F73 TIBETAN VOWEL SIGN II 0F75 TIBETAN VOWEL SIGN UU 0F81 TIBETAN VOWEL SIGN REVERSED II These 2-part Tibetan vowels are not needed. Their canonical decompositions are to sequences of combining marks. 2. 2-part Greek accent 0344 COMBINING GREEK DIALYTIKA TONOS Its canonical decomposition is to a sequence of combining marks. D. Indicated as "strongly discouraged", but reserved for use with special protocols. 1. Tag Characters E0001 LANGUAGE TAG ... E007F CANCEL TAG These were born "strongly discouraged" by the UTC, but were not marked as deprecated, since they were put in explicitly for particular protocol usage. E. Indicated as "strongly discouraged" for plain text interchange 1. Interlinear Annotation Characters FFF9 INTERLINEAR ANNOTATION ANCHOR FFFA INTERLINEAR ANNOTATION SEPARATOR FFFB INTERLINEAR ANNOTATION TERMINATOR See p. 326 of TUS 3.0. "Usage of the annotation character in plain text interchange is strongly discouraged without prior agreement between the sender and the receiver..." This is another way of saying that they are reserved for use with a higher-level protocol. Then we have groups of characters that are not overtly labelled as deprecated or discouraged, but for which there are implied discouragements by reason of their belonging to disparaged classes of characters. F. Indicated as "strongly discouraged" "in general" 1. Letterlike symbols "that are merely font variants or alternative representations of other character sequences." (see TUS 3.0, p. 298) This presumably was intended to apply to all the letterlike symbols in the range 2100..213A that have a "" or "" compatibility equivalence. But the exact list is unclear. It probably should not include the Hebrew letterlike math symbols, 2135..2138, which also have a *directional* difference. And it probably *should* include the two instances (212A KELVIN SIGN and 212B ANGSTROM SIGN) that have canonical equivalences. Some of the letterlike math symbols in the letterlike symbols block also have to be un-discouraged, to match the text for the Plane 1 mathematical alphanumeric symbols, whose repertoire they complete. The Plane 1 mathematical alphanumeric symbols are "intended for use only in mathematical or technical notation; they are not intended for use in non-technical text." This does not constitute a generic discouragement of use, but rather constraining their use to particular kinds of text. G. Implicated as "discouraged" for any use but legacy data. 1. All "compatibility" characters. The problem here is the ambiguity between the two senses of compatibility characters. Not all compatibility characters in the sense of characters encoded for legacy compatibility with preexisting standards or usage have compatibility decompositions. Presumably it is the broader sense of compatibility characters that is intended for discouragement here. But we don't have a specified or specifiable list of all compatibility characters in the broader sense. =================================================================== I think that *deprecation* should be a formal action taken by the UTC degrading the status of a character from "approved for general use" to "disapproved for general use". It should require a permanent, formal statement, included as part of the standard (via a UAX, for example), of the reasons for the deprecation. If such a discipline is followed, then it will be meaningful to have a formal character property which indicates the status of an encoded character as deprecated, since such status will be well-defined. And to have teeth, deprecation ought to have some conformance implications as well. We cannot actually remove deprecated characters from the standard, but we ought to have a way for conforming processes to indicate that they do not support deprecated characters. Furthermore, it should be a given that other standards referencing the Unicode Standard would, by default, not make use of deprecated characters, either. "Discouragement of use", on the other hand, should be distinguished from deprecation. It is not a formal status decreed by the UTC, but instead constitutes an implementation guideline, and should be taken as informative only, and subject to editorial updates as needed. Because of this, I think the definition of "deprecation" currently in the standard should be tightened up and turned into something that reflects a specific UTC decision. As it stands now, it is not possible to determine which characters are actually deprecated by the definition and which are not. If the UTC decides that a particular strongly discouraged character or group of characters can cause problems that are severe enough to warrant a formal recommendation of their non-use, then it can vote to deprecate them and add them to the formal list of deprecated characters. But I do not think that we should have a "discouraged" character property, precisely because we are so fuzzy in its application, ranging from some particular "strongly discouraged" characters that probably ought to be formally deprecated, to general discouragement of the use of all compatibility characters. Also, since discouragement of use is partly in the eye of the beholder, depending on what kinds of implementations one is doing, we risk resurrecting the civil wars between the Cleanicode advocates and the Unicode for legacy support advocates in the committee if we have to pin this stuff down more formally.