Comments on the proposed deprecation of characters
(public review item #122)

General comment

Deprecating a character results implicitly in deprecating any existing data containing the character. I realize that there is nothing formal in the standard to that effect, but I foresee this as a practical consequence of implementers not allocating scarce resources to deprecated characters.

Where data containing such characters is "broken" (i.e. where its usability is compromised by the problematic nature of the deprecated character) deprecating the character makes sense. Where there's merely a question of an available less ambiguous encoding or the existence of a preferred spelling, deprecation does not make sense and should not be considered.

In other words, I think retaining a distinction between formal deprecation and other forms of negative usage recommendation is useful and important. In particular, it's important to not start down the road of deprecating all compatibility characters!

What would be most useful, in addition to formal deprecation, would be a handy (human readable!) reference that lists all characters that need "special attention" by implementers, beyond what can be determined by formal character properties (together with pointers where to find the recommended treatment). Such a list, even as a UTN, might be more valuable, relevant and applicable than an ever-expanding list of 'deprecated' characters.

On PRI #122

Table 1.: Discouraged characters

I strongly support the deprecation of 2329 and 232A. While these are still present in many mapping tables, their canonical decomposition to CJK brackets will be harmful (destroying the formatting of documents). For that reason, the characters are effectively 'broken', their use represents a "trap" for the unwary, changing their documents when normalization is applied, so deprecating them can serve as a warning of unintended consequences.

As a counterexample, the use of 20A4 Lira sign is largely devoid of such issues. The POUND SIGN is a preferred encoding, so there's no need to use 20A4, but deprecating it would be over the top. This should be documented simply with a * notice giving the preferred encoding.

The three character for units, 2126, 212A and 212B are typical compatibility characters - in that context they are often also implemented as wide characters, a distinction that's not been handled via deprecation for other characters. Therefore, I would not recommend that UTC deprecate these, rather that UTC continue documenting the use of their preferred encodings. (Omega, K, and A with Ring).

The three Greek characters have no alternative representation (graphically) from their preferred counterparts. Deprecating them might be mildly useful, however I would not want to make a recommendation.

I have no opinion on the Tibetan and Khmer characters.

Table 2: Other proposed

I support the deprecation of 0149 because of the issue of the broken NFKC. In an NFKC context, use of this character will have unintended consequences, making deprecation warranted over and beyond a mere recommended encoding alternative.

I support the deprecation of 0953-0954 in analogy to 0340 and 0341. Again, there's no distinction in visual representation.

I have no opinion on the Tibetan character.


submitted by Asmus Freytag, 2008/07/31