Re: CJK conversion problem

From: John H. Jenkins (
Date: Thu Jul 06 2000 - 17:51:28 EDT

At 12:48 PM -0800 7/6/00, Oliver Steinau wrote:
>I need to write a program to convert sort of all kinds of CJK encodings to
>Unicode (UTF-8, to be precise). I got the Unihan3.0 file from
>and made a quick analysis of the CNS characters in there. I used the
>and kIRG_TSource tags, converted the codes to r/c and printed the result.
>I must admit that I don't understand the results I get:
>There are entries for plane 3, rows > 66 (which, according to Ken Lunde's
>are not defined; plane 3 stops at row 66);

I'll have to double-check this one. If this is the case, then it will
be yet another instance where the IRG was given a version of the
standard which differs slightly from the printed one.

>OTOH, I found quite some
>missing from plane 4, and almost all from planes 5, 6, 7, and 15.
>My questions are: Why are so many characters missing?

Two main reasons:

1) When Taiwan was proposing characters from CNS 11643-1992 to add to
Unicode as part of the Vertical Extension A, they guessed (rightly)
that they would end up in the room vacated by obsolete Korean hangul
from U+3400 to U+4DFF, and they did some triage. They determined
which characters were most important and only proposed those.

2) A lot of the characters in CNS 11643-1992 are unifiable variants
under Unicode's rules and so could not be included as distinct
ideographs in Unihan.

These problems are currently being worked on. The remaining
non-unifiable ideographs from CNS 11643-1992 are a part of Vertical
Extension B, which will likely be an official part of Unicode within
a year. The remaining unifiable ideographs will likely be added to
the standard as a set of new compatibility ideographs (also in Plane

>And: what am I supposed to do if I encounter a text that uses these

Well, you can temporarily assign them to positions in Unicode's user
area. There isn't much more you can do until official mappings are
available. Given the amount of tweaking that ideograph proposals get
at the last minute, it is *really* unwise to use their provisional
code points.

John H. Jenkins

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT