Re: When is glyph decomposition warranted?

From: Jon Babcock (
Date: Sun Aug 29 1999 - 05:57:40 EDT

This subject touches on points, related to the Chinese script, that I
have been mulling over since joining the list a few years ago. Note
that although it commands the majority of code points in the Unicode
standard, Chinese still can not be fully represented using this
standard. And this is not the omission of a few rare details, but the
inability to represent thirty thousand Chinese graphs that are already
found in the lexicons, plus any newly invented graphs of the future.
Yes, these are the least-used thirty odd thousand. And by using the
Ideographic Description Characters of Unicode 3.0, many of them may be
able to be described using Unicode, which is a step forward. But I
wonder if part of the problem in dealing with Chinese has not been
confusion over this question, "When is glyph decomposition warranted?"

Dean Snyder writes:

> I tentatively suggest then, for a human language encoding scheme such as Unicode (ignoring for the moment the graphic and dingbat symbol areas), that glyph decomposition based upon purely visual criteria is, in general, not useful, whereas glyph decomposition based upon linguistic criteria MAY be useful. And the decision whether to decompose or not will be based both on one's definition of "utility" and on the levels of meaningful discreteness desired in the encoding.

As a student of Chinese calligraphy knows, Chinese glyphs may be,
indeed, are, composed "based upon purely visual criteria". There are
different traditions as to the number of these basic visual elements,
like graphic primitives perhaps, but one well-known tradition has eight,
the 'yongzibafa', i.e., 'the eight model patterns of the graph yong3
(U+6C38)'. With eight strokes (and lots of twisting and turning and a
bit of imagination) you can write any Chinese character. Why did not a
small set of such strokes form the basis for representing Chinese in
Unicode? Because they are, like the elements

> > / o o o / and o---
> > \ | / \ o

of Dean's Akkadian cuneiform example, or the "Q" and "O" and probably
the "i", "j", ";", and ":" of Marcus Kuhn's English examples, "based
upon purely visual criteria" and not "based upon linguistic criteria".
They are certainly part of a glyph's visual history, but not it's
linguistic history. (Probably a source even better than traditional
calligraphy for decomposing Chinese glyphs on visual grounds would be
the tradition of woodblock text carvers who, learning how to carve a
limited number of shapes, say, 500, could compose any Chinese glyph.)

Well, for Chinese, what are "the levels of meaningful discreteness
desired in the encoding"? It is possible that the short answer is, the

The problem is that, in spite of the fact that the analysis of Chinese
glyphs into hemigrams was very well demonstrated 1900 years ago by
Xushen, in his _Shuo Wen Jie Zi_ Dictionary, and many scholars have
worked on this over the centuries, the recent history of Chinese
printing, after woodblocks became a thing of the past and, ironically,
especially since computers, strongly favors the brute force method of
listing every precomposed glyph one might want to use in ever-expanding
lists, occasionally brought under control by government fiat in the form
of lists of sanctioned graphs and sanctioned forms of these graphs. I
assume this was the situation Unicode faced when starting to deal with
the construction of Unihan. It would seem that the decision whether or
not to unify several variant forms of a single Chinese graph, especially
where these variants appear regularly in different locales, was largely
determined by the need to accommodate the existing CJK character sets
and not solely by a purely disinterested analysis. (Many simplified
forms were included along with their traditional counterparts, for
example, although these variant forms are purely visual differences with
no underlying linguistic differences.) As a practical matter, it
probably had to be that way.

For some future version of the Unicode standard, it would be nice if the
big job of hemigramic analysis were carried out so that all the
hemigrams of Chinese that were not already in Unicode could be
included. Then Unicode could be used to indicate any Han character
behind any Han glyph, even newly invented ones. In other words, it could
be used to fully represent the Chinese script.

As Peter A. Boodberg wrote 45 years ago,

"The number of graphemes [of the Chinese script] runs from 500 to 800,
estimated on a purely graphic [visual] basis, and to over 2000, if
reckoned on an organic-structural, historical, and phonosemantic basis.
These form in bidimensional combinations a graphicon of some 50,000
graphs or lexigrams (of which only about 10,000 are in common use.)"
_Cedules from a Berkeley Workshop in Asiatic Philology_, 015-541120.
[Out of print.]

I think a case can be made that these 2000 or so Chinese graphemes are
what could be found "useful, both for cultural and computational
reasons" and that future versions of Unicode would benefit by supporting
their use in the decomposition of Chinese glyphs.


Jon Babcock <>

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT