Nineteenth International Unicode Conference

Typological Encoding of Chinese: Characters, Not Glyphs

Richard Cook - STEDT Project, Linguistics Department University of California, Berkeley

Intended Audience:	Manager, Software Engineer, Systems Analyst, Marketer
Session Level:	Intermediate

Statement of Purpose:

This presentation is concerned with examining the relation between the following two points:

The difficulties of holding the Chinese-related (CJKV) scripts answerable to the same principles governing other script encodings.
The question of adequate typologization of CJKV scripts, and its implications for the encoding of other scripts.

Paper Description:

The enormous (even "open-ended") unified CJKV character set, encoding at present more than 71,000 characters, perfectly illustrates the limitations of the "Characters, Not Glyphs" distinction drawn in the Unicode Standard (v. 3.0, e.g. p. 13).

Unicode's CJKV encoding is at present typologized only insofar as it is unified, which is to say that a given encoded character serves to typify an abstract character form which may be realized (as glyph) in a particular script (Chinese, Japanese, Korean or Vietnamese) according to the stylistic traditions of that script.

CJKV typologization could however be taken much further, and in fact arguably should be taken much further, if CJKV scripts would be held to the same standards that other scripts (e.g. Arabic, Myanmar, Tibetan ...) have been.

How and why have the CJKV characters not been held to the same standards, and what problems does this present? How might Unicode implementers benefit from componential data being added to the Unihan database, and what might such data look like? How do the Ideographic Description Characters (IDC) and Ideographic Description Sequences (IDS) relate to this?

This paper explores these questions, and in illustrating the componential nature of CJKV characters, demonstrates a method for determining (on the basis of the present CJKV character set) a base extensible component set for generating and encoding infinite CJKV characters. It is shown that this framework permits closer adherence to the spirit of the "Characters, Not Glyphs" distinction.

It is argued that adequate historical typologization of the Chinese script can only be accomplished with reference to specific texts and inscriptions, and this is what I mean when I refer to a "Text-based" or "Source-based" encoding. These encodings may also be termed "Contextual" in that they seek to document the historical context whence the glyph usage derives. The written sources of ancient Chinese are many and varied, and each offers stylistic peculiarities and mapping challenges. Oracle Bone Inscriptions, Bronze Inscriptions, Stone and Earthenware Inscriptions ... all of these contain vital historical information which only a typological system can address.

A typologization adequate for Chinese purposes would be rather simple in some respects. First, it should characterize aspects of glyph shape, and second, it should characterize aspects of glyph usage. One may however complicate and broaden this scheme in many ways, for example with script-specific issues of componential encoding. It seems that distinctions adequate for the handling of the Chinese script must have implications for the encoding of other scripts.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

22 Jun 2001, Webmaster