Re: Problems/Issues with CJK and Unicode

From: jon@kanji.com
Date: Sat Apr 08 2000 - 05:07:21 EDT


Thanks to John Jenkins and Kenneth Whistler for clarifying why
things are the way they are in CJK Unicode. My only excuse for
excessive length of this message is that I rarely post.

I look at CJK Unicode from the outside, at the results as
presented in the Standard, not as an insider who is acquainted
the whys and wherefores of its current form or as one who knows
what will be included in the next release.

Looking at the Standard V.2, I saw a list of about 21,000 kanji
(hanzi). I noted that this was about 30,000 short of covering
all the kanji entries in, for example, the large Morohashi
dictionary (Daikanwajiten). I noticed that a small but
significant minority of these, perhaps 600 or so, were alternate
forms of the same thing and that the majority of these alternate
forms belonged to the group of simplified forms.

Looking at Standard V. 3, I see the list has increased to about
29,000 kanji. Now it covers over half of the kanji entries in
the large Morohashi, but less than half of other estimates of
the total number of kanji, and hence my statement that Unicode
still does not cover (perhaps I should have said 'does not have
a code point for') the majority of kanji.

Ken said Vertical Extension B (I confess this was the first I'd
heard of it) will add enough kanji to cover all the kanji in
Siku Quanshu (I remember it well ... about fifty lineal feet of
fairly small volumes in the dark recesses of the Far East
Library at the U. of Washington in the early sixties), as well
as the Chinese Encyclopedia (I assume this is the Yonglodadian,
whose text was originally composed of over 370 million kanji of
which only a small fraction remains), the Kangxi Dictionary, as
well as a group of recent well-known Chinese dictionaries and
kanji from CNS. So I assume that Vertical Extension B will add
at least another 20,000 kanji or so. At that point Unicode most
certainly will cover nearly all the kanji anyone could
reasonably ask for. Its job of representing Chinese will be
finished. Unless a few thousand more kanji are added.

I thought it interesting that even though so many tens of
thousands of code points are used to cover CJK kanji, and for
the reasons Ken has stated, some kanji have more than one code
point, there still remain kanji that are not covered. So, in the
case of Chinese, Japanese, Korean, or old Vietnamese, you cannot
say that the entire written script is covered in Unicode in the
way that you can say that of Latin, or Greek.

But you could say that if the parts of the kanji were encoded.

> Chinese lexicography works at different levels. The
> lexicographic unit of a dictionary of zi4 (U+5B57) is a zi4,
> but that does not mean that a zi4 equates to a lexeme. The
> lexicographic unit of a dictionary of ci2 (U+8FAD = U+8F9E) is
> a ci2, and that *does* correlate fairly closely with a lexical
> word, or lexeme.

I agree. Let us not try to equate a wen2 or a zi4, by itself, to
a lexeme. It would not apply in the case of binom (ci2)
dictionaries. In the case of zi4 dian3, like the Kangxi Zidian,
we can refer to them as one-graph dictionary entries. You are
right, ci2, binoms (compounds of two or occasionally more
kanji), more closely correspond to English lexemes. But before I
can appreciate what the two-graph (ci2) entries in the Chinese
dictionaries are, I want to understand what the one-graph (wen2
or zi4) entries in the Chinese dictionaries are.

Again, looking at this from the outside, I noticed that what
were being called 'characters' for most non-CJK languages were
units that were used to compose dictionary entries but, on the
contrary, in the case of Chinese, what were being called
'characters' were, in nearly every case, the one-graph
dictionary entry itself, i.e, the _result_ of a process of
composition, where two hemigrams were combined to form a
one-graph dictionary entry. This made it easier to understand
why estimates of the 'total number of Chinese characters', from
50,000 to 70,000 or more, could vary so widely. If these graphs
themselves really were the graphemes of the Chinese script, why
couln't their number be counted more precisely than within
20,000, or 10,000 or 5,000 or even 50? The answer is that 99% of
these graphs are not the graphemes of Chinese. They are more
like the dictionary entries of English than they are like the
letters of English.

> The "hemigrams" (which, by the way, do not exhaust all the
> pieces you would need to combine to construct all the zi4) have
> little claim to status as graphemes, compared to the zi4 themselves.

On the contrary, the hemigrams do exhaust all the pieces that
are needed to construct all the zi4. I would be interested in
seeing one example of a zi4 that has not been spilt into
hemigrams by Chinese dictionaries.

What is a zi4? It is Chinese graph that can be split into
two halves. And what is a wen2? It is a Chinese graph that
cannot be split. On whose authority? Xu Shen (121 ? AD) and all
Chinese dictionaries since then.

Call these hemigrams what you want, they can combine to form all
the zi4 and there are a limited number of them, probably less
than 2000.

Do I think it is/was up to Unicode to figure all this out? No, I
do not. I have changed my mind and now I think that giving each
Chinese graph its own code point is the best way for Unicode,
given the practical reasons presented by Ken W. and John J. My
response was just an example of what the Unicode approach to CJK
stirred up in one fan.

In Version 3 of the Standard, the 214 Kangxi classifiers
have been singled out and given their own new, code points in
addition to the ones they already had. And the Description
Characters (p.565) call attention to the various two-dimensional
arrangements that the hemigrams may take. This is very handy. If
this were extended to the phonetic hemigrams at some point in
the future, then you could represent all kanji with a couple
thousand code points. I found that possibility attractive.

Jon

-- 
Jon Babcock <jon@kanji.com>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT