possible corrections to the unihan.txt database

From: Doug Schiffer (laotzuREMOVE_THIS@dreamscape.com)
Date: Sun Feb 14 1999 - 13:46:33 EST


As part of my CCCII project, I've been cross-checking the CNS<->CCCII
information in Christian Wittern's Kanjibase
(http://www.oas.hist.uni-goettingen.de/kbwww/kbQuery.htm) against a
similar table derived from the Unihan.txt database maintained on the
unicode.org
site(ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/UNIHAN.TXT). I've
been walking down this table in CNS92 order, and the following lists are
from CNS plane 1. I will shortly have similar lists for CNS Planes 2 &
3.

I would appreciate any input, to help further improve the accuracy of
both databases, and thus how well we can match up these diverse
character sets. It's very possible that the CNS or CCCII font files that
I have been using could contain errors that would foul up my analysis of
any discrepancies. For my work, I've been using the font files from the
following sites:

CNS: http://www.ifcss.org/ftp-pub/software/fonts/cns/hbf/
CCCII: ftp://nctuccca.edu.tw/Chinese/CCDB/

The following codepoints were missing CCCII entries in the Unihan.txt
database, but appeared to have valid CCCII matches

Unicode CNS92 CCCII
======= ====== ======
503C 1-542B 21317C
524E 1-5026 213367
555F 1-5A76 21424F
568F 1-734A 213753
59EC 1-5478 213978
5F37 1-5A30 213D48
5F5E 1-7641 242D37
6085 1-5550 213E4B
6735 1-4838 214370
7946 1-5269 214E63
7DD2 1-6A45 215155

In the following codepoints, I decided to use the plane 1 equivalent
rather than the higher plane codepoint that was listed in the Unihan.txt
file:

Unicode CNS92 Plane 1 Unihan.txt
                Value Value
======= ====== ======= ==========
52E6 1-633A 21343B 33337B
58F9 1-5E62 213875 333021
5FA0 1-5A3A 222A36 33314C
634D 1-5564 223150 2E2F7C
6AC2 1-7653 214567 39447D
6C3E 1-4666 21465F 394735
6C61 1-4844 244A54 334665
6D38 1-522C 224854 2E4D3D
6EF7 1-6958 224C63 2F5D3C
7030 1-7975 21493E 333D4C
7032 1-7976 245034 2E4E41
7156 1-6524 22526D 334342
7609 1-696E 214C64 393E7D
7652 1-766B 214C72 333E7D
7A1C 1-655E 214F38 2E3D73
7BE0 1-745B 226D34 2E6C26
7D43 1-5C52 215127 333D42
85A6 1-7535 23262B 395477
8907 1-6E6F 215770 393D6F
8A3C 1-6225 215850 39593F
8CB3 1-6233 215974 333051
8FF4 1-584F 215B79 333768
8FFA 1-584E 215B78 33303A
9452 1-7B7D 215E41 335E42
965E 1-5868 234A4D 39345B
9B28 1-7334 216169 33362A
             

The following codepoints seem to have incorrect values in the CCCII
field. This could also be caused by my CCCII font file being incorrect.

Unicode CNS92 Correct Unihan.txt
                        Value
======= ====== ======= ==========
4F0D 1-4730 21307B 393054
696E 1-5F7D 223D72 223E69
7BB8 1-6A2F 226C39 226C59



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT