L2/03-311 Date/Time: Fri Sep 12 04:28:20 EDT 2003 Contact: andrewcwest@alumni.princeton.edu Report Type: Public Review Issue RE: Unicode 4.0.1 Beta Review Please find below my list of errata for the kRSUnicode and kMandarin fields of the Unihan database ("Unihan-4.0.1d1.txt"). 1. kRSUnicode Field In general the kRSUnicode data is fairly reliable. My only major criticism is that the method of counting strokes is not consistant for all CJK ideographs. Many ideographs have an "old" and "new" form with different stroke counts, and the use of either old or new form for counting strokes has not been applied consistantly. The most consistant treatment would be to count the strokes of the Unicode reference glyph, but in many cases the stroke count appears to be for a glyph form that is different from the reference glyph.. Thus, for example where U+7232 is used as a phonetic element, in most cases it is counted as 12 strokes (i.e. as the "old" form of the character), but in a couple of cases [U+5645, U+87E1] it is counted as 9 strokes (i.e. as if it were the "new" form of the character, U+70BA, even though the reference glyph uses the "old" form). A. Missing Simplified Radical Marker For the basic CJK block, the simplified form of a radical is indicated by an apostrophe after the radical number. There are some omissions of the apostrophe for basic CJK ideographs, and the aopostrophe is omitted for all ideographs in the CJK-A and CJK-B blocks. It would be very useful if the apostrophe was applied to all simplified radicals for all CJK blocks. At present I have to manually add it where missing as a prelude to automatic processing of the Unihan kRSUnicode data. U+4336..4341 Radical 120 should be 120' U+4723..4729 Radical 149 should be 149' U+478C..4790 Radical 154 should be 154' U+4880..4882 Radical 159 should be 159' U+497A..4986 Radical 167 should be 167' U+49B6..49B8 Radical 169 should be 169' U+4B6A Radical 184 should be 184' U+4BC3..4BC5 Radical 187 should be 187' U+4C9D..4CA4 Radical 195 should be 195' U+4D13..4D19 Radical 196 should be 196' U+4DAD..4DAE Radical 212 should be 212' U+8D5C 154.11 should be 154'.11 U+8D5D 154.12 should be 154'.12 U+8F89 159.8 should be 159'.8 U+987C 181.4 should be 181'.4 U+9E6B 196.12 should be 196'.12 U+9EA6 199.0 should be 199'.0 U+9EFE 205.0 should be 205'.0 U+9F7F 211.0 should be 211'.0 U+26208..26221 Radical 120 should be 120' U+27BAA Radical 149 should be 149' U+27E51..27E57 Radical 154 should be 154' U+28405..2840A Radical 159 should be 159' U+28C3E..28C56 Radical 167 should be 167' U+28DFF..28E0E Radical 169 should be 169' U+293FC..29400 Radical 178 should be 178' U+29595..29597 Radical 181 should be 181' U+29665..29670 Radical 182 should be 182' U+297FE..2980F Radical 184 should be 184' U+299E6..29A10 Radical 187 should be 187' U+29F79..29F8E Radical 195 should be 195' U+2A241..2A255 Radical 196 should be 196' U+2A388..2A390 Radical 199 should be 199' U+2A68F..2A690 Radical 211 should be 211' U+2FA18 Radical 205 should be 205' B. Wrong Stroke Count These are just the obvious errors. There are very many "out by one or two" errors by my reckoning, but these are mostly due to the fact that some characters can be written in different ways, and there are different methods of counting strokes. U+5645 kRSUnicode 30.9 should be 30.12 U+62B2 kRSUnicode 64.11 should be 64.5 U+7777 kRSUnicode 109.19 should be 109.9 U+8624 kRSUnicode 140.14 should be 140.17 U+87E1 kRSUnicode 142.9 should be 142.12 U+9484 kRSUnicode 167.17 should be 167.21 U+9ED9 kRSUnicode 203.11 should be 203.4 U+2005D kRSUnicode 1.15 should be 1.16 (Unihan 3.2 has 1.16 which is correct by my reckoning) C. Wrong Radical The first case is probably a simple typo, the other two are misclassifications. There may well be other cases of misclassified radicals, but these are the only ones I've really noticed. U+4CED kRSUnicode 199.9 should be 196.9 (also stroke count is not quite right by my reckoning) U+5655 kRSUnicode 87.11 should be 30.12 U+81A4 kRSUnicode 74.11 should be 130.11 (definition is "eggs of birds or reptiles; testicles") D. Inconsistant Radical for Traditional/Simplified Equivalents These are cases where simplified and traditional forms of the same character are assigned different radicals, even though they structurally identical. U+8206 kRSUnicode 134.10 but U+8F3F kRSUnicode 159.10 (134.10 should be 159'.10) U+89C6 kRSUnicode 113.4 but U+8996 kRSUnicode 147.4 (113.4 should be 147'.4) 2. kMandarin Field I am pleased to note that the problem of misplaced readings in Unihan 3.2 have now been rectified, and that most of the invalid Mandarin pinyin spellings have been corrected. However, the readings appear to derive from a variety of sources of varying degrees of reliability, with the result that many of the kMandarin readings are distinctly questionable. The kMandarin field would be one of the most useful fields in the Unihan database if the data could be relied on as accurate, but unfortunately it cannot be relied on at the present time. As a long-term goal I would suggest that the current kMandarin data is scrapped, and completely new data derived [as far as possible] from a single, authoritative source such as the _Hanyu Da Zidian_. A. Invalid Representation of U-UMLAUT I am aware that this is noted as a known limitation in the Unihan file, but here is a list of all the affected codepoints (all corresponding CJK-A characters). It should be quite simple to clear these up, and it would save having to preprocess the kMandarin data. U+347C kMandarin LYUE4 should be LÜE4 U+3500 kMandarin LVE4 QING2 should be LÜE4 QING2 U+3527 kMandarin LYU4 XUE4 should be LÜ4 XUE4 U+35C9 kMandarin BI4 E4 LVE4 should be BI4 E4 LÜE4 U+3825 kMandarin LV4 should be LÜ4 U+385E kMandarin LEI3 LOU2 LYU3 should be LEI3 LOU2 LÜ3 U+38B3 kMandarin LU2 LYU4 should be LU2 LÜ4 U+3B5A kMandarin LYU3 should be LÜ3 U+3B9F kMandarin JI2 NIAN3 NYU4 PENG4 ROU4 should be JI2 NIAN3 NÜ4 PENG4 ROU4 U+3CB6 kMandarin LV4 should be LÜ4 U+3D56 kMandarin NYU4 should be NÜ4 U+3EF2 kMandarin LYU3 should be LÜ3 U+3F94 kMandarin LYU3 should be LÜ3 U+40AE kMandarin LYUE4 should be LÜE4 U+430E kMandarin LYUE4 should be LÜE4 U+451E kMandarin LYU4 should be LÜ4 U+4561 kMandarin LU3 LYU2 should be LU3 LÜ2 U+45A1 kMandarin NYU4 should be NÜ4 U+4610 kMandarin NYU4 should be NÜ4 U+46BC kMandarin NYU4 should be NÜ4 U+46DA kMandarin LYUE4 should be LÜE4 U+474F kMandarin LOU2 LOU3 LYU2 should be LOU2 LOU3 LÜ2 U+4896 kMandarin LYU4 YU4 should be LÜ4 YU4 U+48DA kMandarin LOU2 LYU2 should be LOU2 LÜ2 U+491A kMandarin LI3 LVE4 should be LI3 LÜE4 U+4923 kMandarin LYUE4 should be LÜE4 U+4968 kMandarin LYU4 should be LÜ4 U+4A0B kMandarin NYUE4 should be NÜE4 U+4B89 kMandarin LU2 LYU4 should be LU2 LÜ4 U+4BAB kMandarin LOU2 LYU2 should be LOU2 LÜ2 U+4D8A kMandarin NYU4 should be NÜ4 B. Invalid Application of U-UMLAUT LÜN and LÜAN are invalid Mandarin pinyin spellings. U+6523 kMandarin LÜAN2 LUAN2 should be LUAN2 U+7674 kMandarin LÜAN2 should be LUAN2 U+7D6F kMandarin LÜN4 GAI1 should be LUN4 GAI1 C. Invalid Non-Application of U-UMLAUT LUE and NUE should always be written with an umlaut. U+3A3C kMandarin LUE4 should be LÜE4 U+4588 kMandarin NUE4 should be NÜE4 U+458B kMandarin NUE4 should be NÜE4 U+63A0 kMandarin LÜE4 LUE3 should be LÜE4 LÜE3 U+7878 kMandarin NUE4 should be NÜE4 D. Other Invalid Pinyin Spellings IE, AU, LIONG and IONG are invalid Mandarin pinyin spellings. U+34C8 kMandarin BEI4 BING4 FEI4 IE4 should be BEI4 BING4 FEI4 YE4 ? U+3729 kMandarin AU4 BIE2 should be AO4 BIE2 U+3D88 kMandarin LIONG3 YING2 should be YONG3 YING2 ? U+4071 kMandarin AU4 should be AO4 U+4D08 kMandarin AU3 should be AO3 U+7867 kMandarin IONG3 should be YONG3 E. Duplicate Readings U+3561 kMandarin HE2 HE2 HE4 HE4 HUO4 U+3563 kMandarin YAN3 YAN4 YAN4 U+356A kMandarin DAN3 DAN3 U+3613 kMandarin LAN2 LAN2 U+363A kMandarin FA2 FA2 U+369C kMandarin XU4 XU4 YU4 U+38BD kMandarin ER3 ER3 U+39A9 kMandarin YIN3 YIN3 U+39E5 kMandarin XIAN3 XIAN3 U+3C34 kMandarin PO2 POU3 POU3 U+3C76 kMandarin BENG4 JIAO4 PENG2 PENG2 QIAO3 RU4 U+3C78 kMandarin BI4 BI4 BIE2 U+3E2E kMandarin FEN2 FEN2 U+3F18 kMandarin WA3 WA3 U+3FC5 kMandarin XIAN3 XUAN3 U+400E kMandarin MIAN3 MIAN3 U+4014 kMandarin NIU2 REN4 REN4 U+414D kMandarin DONG4 DONG4 TING3 U+4192 kMandarin JIU4 JIU4 U+42BC kMandarin CHI3 CHI3 U+431F kMandarin BI4 BI4 U+443D kMandarin MAN2 MAN2 U+46AC kMandarin LIN2 LIN2 U+47A2 kMandarin ZHA4 ZUO2 ZUO2 U+47F8 kMandarin KUI2 KUI2 U+4902 kMandarin MEI2 MEI2 U+4913 kMandarin MENG2 MENG2 U+49DD kMandarin DI4 YI2 ZHI4 ZHI4 U+4A8A kMandarin LONG2 LONG2 U+4B6A kMandarin LIANG2 LIANG2 U+4C95 kMandarin PU3 PU3 PU4 U+4DA7 kMandarin YAO3 YAO3 U+4E0A kMandarin SHANG4 SHANG4 U+4E21 kMandarin LIANG3 LIANG3 LIANG4 U+4E24 kMandarin LIANG3 LIANG3 LIANG4 U+4E4E kMandarin HU1 HU2 HU1 U+4E86 kMandarin LIAO3 LIAO3 U+4EC0 kMandarin SHI2 SHEN2 SHI2 SHI2 U+4EF0 kMandarin YANG3 YANG3 YANG4 ANG2 YANG4 ANG2 U+4F5D kMandarin KOU4 GOU1 KOU4 U+4F63 kMandarin YONG4 YONG1 YONG2 YONG2 U+4F7F kMandarin SHI3 SHI3 SHI4 U+4F97 kMandarin TONG2 TONG3 DONG4 DONG4 TONG3 U+4FF3 kMandarin PAI2 PAI2 U+5012 kMandarin DAO3 DAO4 DAO4 U+5169 kMandarin LIANG3 LIANG3 LIANG4 U+5207 kMandarin QIE1 QIE4 QIE4 QI4 QI4 QI4 QI4 U+5212 kMandarin HUA2 HUA4 HUA2 U+524A kMandarin XUE1 XIAO1 XUE1 U+525D kMandarin BO1 BAO1 BO1 U+52B3 kMandarin LAO2 LAO2 LAO4 U+52D8 kMandarin KAN1 KAN4 KAN1 U+52DE kMandarin LAO2 LAO2 LAO4 U+5395 kMandarin CE4 CE4 U+53A0 kMandarin CE4 CE4 U+53C2 kMandarin CAN1 SAN1 SHEN1 DEN1 CEN1 SHEN1 SAN1 SAN3 U+53C3 kMandarin CAN1 SAN1 SHEN1 DEN1 CEN1 SHEN1 SAN1 SAN1 U+53CD kMandarin FAN3 FAN3 FAN1 U+53F6 kMandarin XIE2 YE4 YE4 SHE4 U+5403 kMandarin CHI1 CHI1 JI1 U+5408 kMandarin HE2 HE2 GE3 U+5410 kMandarin TU3 TU4 TU3 U+5439 kMandarin CHUI1 CHUI1 CHUI4 U+54D1 kMandarin YA3 YA1 YA3 E4 U+5527 kMandarin JI1 JI1 U+555E kMandarin YA3 YA1 YA3 E4 U+5587 kMandarin LA3 LA3 U+559E kMandarin JI1 JI1 U+55AB kMandarin CHI1 CHI1 JI1 U+561B kMandarin MA5 MA5 U+5632 kMandarin CHAO2 ZHAO1 ZHAO1 U+5730 kMandarin DI4 DE5 DI4 U+575E kMandarin WU4 WU4 U+5795 kMandarin HOU4 HOU4 U+57AB kMandarin DIAN4 DIAN4 U+57E0 kMandarin BU4 BU4 U+57FD kMandarin SAO4 SAO3 SAO3 U+5862 kMandarin WU4 WU4 U+588A kMandarin DIAN4 DIAN4 U+58A9 kMandarin DUN1 DUN1 U+58AA kMandarin DUN1 DUN1 U+58AC kMandarin DI4 DE5 DI4 U+590D kMandarin FU4 FU4 U+59E8 kMandarin YI2 YI2 U+5B32 kMandarin NIAO3 NIAO3 U+5B81 kMandarin ZHU4 NING2 NING4 NING2 U+5BD5 kMandarin NING2 NING4 NING2 U+5BE7 kMandarin ZHU4 NING2 NING4 NING2 U+5C0A kMandarin ZUN1 ZUN1 U+5C7F kMandarin YU3 XU4 YU3 U+5DBC kMandarin YU3 XU4 YU3 U+5E05 kMandarin SHUAI4 SHUAI4 SHUO4 U+5E25 kMandarin SHUAI4 SHUAI4 SHUO4 U+5EC1 kMandarin CE4 CE4 U+5F04 kMandarin NONG4 LONG4 LONG4 U+5F0E kMandarin SAN1 SAN1 SAN4 U+5F1F kMandarin DI4 TI4 TI4 U+5F4A kMandarin QIANG2 QIANG3 JIANG1 JIANG1 JIANG4 U+5F87 kMandarin XUN4 XUN2 XUN4 U+5F95 kMandarin LAI2 LAI2 LAI4 U+5FA0 kMandarin LAI2 LAI2 LAI4 U+5FA9 kMandarin FU4 FU4 FOU4 U+5FF8 kMandarin NIU3 NIU3 NÜ4 U+608C kMandarin TI4 TI4 U+60DD kMandarin TANG3 CHANG3 CHANG3 U+6310 kMandarin RU2 NA2 NU2 RAO2 RU2 U+6837 kMandarin YANG2 YANG4 XIANG4 YANG2 U+6BD3 kMandarin YU4 YU4 U+6C59 kMandarin WU1 WU4 WA1 WU4 YU1 U+7089 kMandarin LU2 LU2 U+762A kMandarin PIE1 BIE3 BIE1 BIE3 U+7A79 kMandarin QIONG1 QIONG1 KONG1 QIONG2 U+7F5B kMandarin GU1 GU1 U+7FA1 kMandarin XIAN4 YI2 XIAN4 YAN2 YI2 U+7FA8 kMandarin XIAN4 YI2 YAN2 YI2 U+82A6 kMandarin LU2 LU3 LU2 U+8449 kMandarin XIE2 YE4 YE4 SHE4 U+8897 kMandarin ZHEN3 ZHEN1 ZHEN3 U+8910 kMandarin HE2 HE4 HE2 U+8E42 kMandarin ROU2 ROU2 U+8EAA kMandarin LIN4 LIN4 U+8EE4 kMandarin HU1 HU1