L2/03-286 Title: Han variant issues Contributors: Richard Cook, Tom Bishop Date: August 24, 2003 This document is intended as a progress report, summarizing current thinking on Han variant issues, as this thinking has developed in the course of our research and in discussions with people from Adobe, IRG, and UTC. Although the opinions presented here are our own, we believe that they represent some degree of consensus among the discussants. Thanks especially to Ken Lunde, Rick McGowan, Eric Muller, and Ken Whistler, for their valuable input. The set of Unicode 4.0 Unified CJKV Ideographs now encodes a great many characters and variants of different types. No clear system has yet been developed to manage this enormous mass of data, and for this reason quality-control, long-term maintenance, future extension of the repertory, implementation, and usage are all problematic. Fundamental issues remain to be addressed, and for this reason, we would like to suggest that the UTC draft a resolution calling for an immediate moratorium on Han encoding. This moratorium would require that encoding of Han script entities be suspended (to the extent that this is possible) until such time as it is agreed that the following key issues have been resolved: (1) Typology of Han characters and variants; (2) Usage of Variation Selectors (VS) for Han; (3) Elements of a Chinese Character Description Language (CDL). These three issues are not so distinct as their separate enumeration might suggest. On the contrary, they are closely interrelated in that they are all required for precise definition and careful encoding of individual CJKV script entities. In the near future we anticipate providing the UTC with a set of white papers presenting detailed recommendations in the above areas, including examples to substantiate the call for a moratorium. For present purposes, we will look at each of the above three items, and briefly describe its significance to the task of keeping the Unicode UniHan repertory accurate and useful in the long run. Under item (1) we wish to describe which kinds of Chinese-derived script entities exist, how they relate to each other, and which are suitable for encoding (by whatever mechanism). This is important because one cannot relate forms of a particular type (in a variant database, or via the Variant Selector mechanism) unless the types themselves are well defined. Under item (2) we believe that it is necessary to achieve consensus on the question of "When if ever is usage of Variant Selectors appropriate for Han?". As we have seen in Mr. Jenkins' recent UTC contribution regarding VS, there is feeling that the VS mechanism might be suitable for future encoding of regular PRC simplified forms. Are there other cases in which VS usage might be considered? If so, what are they, and what principles can be followed to organize and manage this process? Under item (3) we wish to define a language suitable for precise description of any Han script entity. The system which we will present is an XML version of Wenlin Institute's Chinese Character Description Language (CDL). This extensible system employs a set of basic stroke types, (x,y) coordinates, and transformations. A CDL description encodes an analysis of the character into its constituent components and strokes, and simultaneously provides instructions for displaying the character. A collection of CDL descriptions can therefore serve as both a database and a font. CDL is based on three characteristics of CJKV characters: (1) Most characters are composed by combining two or more simpler characters or components and fitting them into a square. (2) Basic characters or components are composed of strokes, and a set of a few dozen basic stroke types is sufficient for the construction of practically all characters in a modern printed style. (3) The counting of strokes in a character, the order in which the strokes are written, and classifications of those strokes, are important for many kinds of text processing, such as indexing, and comparing variant forms. We envision the elements of the CDL as the foundation of a system for accomplishing the tasks of building, publishing, maintaining, and using the set of CJKV characters and variants. For this reason, it may be appropriate to consider encoding a "CJK Unified Ideographic Stroke Type" block with specific properties, based on modern orthographic standards. ---END---