L2/03-286

Title: Han variant issues
Contributors: Richard Cook, Tom Bishop
Date: August 24, 2003

This document is intended as a progress report, summarizing current
thinking on Han variant issues, as this thinking has developed in the
course of our research and in discussions with people from Adobe, IRG,
and UTC. Although the opinions presented here are our own, we believe
that they represent some degree of consensus among the discussants.
Thanks especially to Ken Lunde, Rick McGowan, Eric Muller, and Ken
Whistler, for their valuable input.

The set of Unicode 4.0 Unified CJKV Ideographs now encodes a great many
characters and variants of different types. No clear system has yet been
developed to manage this enormous mass of data, and for this reason
quality-control, long-term maintenance, future extension of the
repertory, implementation, and usage are all problematic.

Fundamental issues remain to be addressed, and for this reason, we would
like to suggest that the UTC draft a resolution calling for an immediate
moratorium on Han encoding. This moratorium would require that encoding
of Han script entities be suspended (to the extent that this is possible)
until such time as it is agreed that the following key issues have been
resolved:

    (1) Typology of Han characters and variants;

    (2) Usage of Variation Selectors (VS) for Han;

    (3) Elements of a Chinese Character Description Language (CDL).

These three issues are not so distinct as their separate enumeration
might suggest. On the contrary, they are closely interrelated in that they
are all required for precise definition and careful encoding of individual
CJKV script entities.

In the near future we anticipate providing the UTC with a set of white
papers presenting detailed recommendations in the above areas, including
examples to substantiate the call for a moratorium. For present purposes,
we will look at each of the above three items, and briefly describe its
significance to the task of keeping the Unicode UniHan repertory accurate
and useful in the long run.

Under item (1) we wish to describe which kinds of Chinese-derived script
entities exist, how they relate to each other, and which are suitable
for encoding (by whatever mechanism). This is important because one
cannot relate forms of a particular type (in a variant database, or via
the Variant Selector mechanism) unless the types themselves are well
defined.

Under item (2) we believe that it is necessary to achieve consensus on
the question of "When if ever is usage of Variant Selectors appropriate
for Han?". As we have seen in Mr. Jenkins' recent UTC contribution
regarding VS, there is feeling that the VS mechanism might be suitable
for future encoding of regular PRC simplified forms. Are there other
cases in which VS usage might be considered? If so, what are they, and
what principles can be followed to organize and manage this process?

Under item (3) we wish to define a language suitable for precise
description of any Han script entity. The system which we will present
is an XML version of Wenlin Institute's Chinese Character Description
Language (CDL). This extensible system employs a set of basic stroke
types, (x,y) coordinates, and transformations.

A CDL description encodes an analysis of the character into its
constituent components and strokes, and simultaneously provides
instructions for displaying the character. A collection of CDL
descriptions can therefore serve as both a database and a font.

CDL is based on three characteristics of CJKV characters:

    (1) Most characters are composed by combining two
        or more simpler characters or components and
        fitting them into a square.

    (2) Basic characters or components are composed of
        strokes, and a set of a few dozen basic stroke
        types is sufficient for the construction of
        practically all characters in a modern printed style.

    (3) The counting of strokes in a character, the
        order in which the strokes are written, and
        classifications of those strokes, are important
        for many kinds of text processing, such as
        indexing, and comparing variant forms.

We envision the elements of the CDL as the foundation of a system
for accomplishing the tasks of building, publishing, maintaining, and
using the set of CJKV characters and variants. For this reason, it may be
appropriate to consider encoding a "CJK Unified Ideographic Stroke Type"
block with specific properties, based on modern orthographic standards.

---END---