RE: Subset of Unicode to represent Japanese Kanji?

From: Marco.Cimarosti@icl.com
Date: Thu Jul 13 2000 - 15:16:43 EDT


Kevin Bracey wrote:
> ----------------------------------------------------
> Useless Basic Latin only 95
> Limited [...] + halfwidth katakana 158
> Standard [...] + JIS X 0208 7037
> Above average [...] + JIS X 0212 13104

If these memory constraint are really hard, there can be several
intermediate levels between "Limited" and "Standard".

First of all, you can remove from JIS X 0208 all the characters that are not
strictly needed to write Japanese. This includes the Greek and Cyrillic
alphabets, and a handful of dingbats and funny things. My wife would kill me
if I do the counting right now... Let estimate 4 blocks of 94 characters
each (roughly *376* slots saved).

A much more substantial cut can be achieved by selecting only frequently
used kanjis. One good source for a reduced set is "Japan-China-Taiwan
daily-use characters"
(http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/ftp/CJKtable/DailyUse.Z), on
Koichi Yasuoka's CJK page
(http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/CJK.html). I have counted 1992
kanjis in the Japanese column, that corresponds to a net *4364* characters
discount (the basic 6358 JIS kanjis minus Yasuoka's 1992 daily use kanji's).

Another good source for similar figures is Jim Breen's celebrated KANJIDIC
(http://www.csse.monash.edu.au/~jwb/kanjidic.html). The [G] field contains
the kanji "grade" (1 to 6 and 9). It is roughly the school year when a
Japanese kid learns each kanji: 1-6 is elementary school, (7, 8,) and 9 is
primary middle school.

My counts for the seven grades, with progressive sums are:

1: 46, 46
2: 105, 151
3: 186, 337
4: 203, 540
5: 193, 733
6: 142, 875
9: 959, 1834

If you pick all the rated characters, up to grade 9, you should have more or
less the same list of daily use characters mentioned above. (I think
Yasuoka's list is more up-to-date, as it probably reflects more recent
reforms in Japan's schooling system).

If you stop at grade 6, you have (if I'm not mistaken) the famous *Toyou
Kanzi* list, which is the dream of every foreigner student of Japanese. This
would make up a huge *5483* characters saving! However, you must be sure
that your application can do with a relatively basic vocabulary and,
particularly, that it doesn't need many proper names (people or places).

You could even consider stopping at grade 2. This is the Zyouyou Kanzi list
which, the basic literacy level for a Japanese. In this case, you would
nearly reach the numbers of a single-byte character set. The drawback, of
course, is that your application will write Japanese as good as a 7 years
old kid!

None of these reductions is viable for a general purpose application that
has to handle Japanese text. However, if it is just for the messages issued
by a print head controller, who knows...

_ Marco



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT