Re: Unicode 4.0.1 Released

From: Andrew C. West (
Date: Fri Apr 02 2004 - 06:38:37 EST

    On Tue, 30 Mar 2004 15:49:53 -0800, Rick McGowan wrote:
    > Unicode 4.0.1 has been released!
    > The main new features in Unicode 4.0.1 are the following:
    > 1. The first significant update of the Unihan Database (Unihan.txt)
    > since Unicode 3.2.0, including a large number of fixes and
    > additional data items.

    For me 4.0.1 was a big disappointment. The much vaunted update of the Unihan
    database did not even clear up all the editorial errors in the database, let
    alone deal with the real problems of content, such as incorrect or dubious
    Mandarin, Cantonese, Korean and Japanese readings.

    As to the 164 incorrect Vietnamese readings for basic CJKV ideographs, I notice
    that although the correct readings for these characters have now been added to
    the kVietnamese field, the original erroneous readings (agreed as such by the
    relevant Vietnamese experts) have been retained as well; so that now each of the
    164 characters in the CJK Unified Ideographs block with a kVietnamese key has a
    spurious reading followed by the correct reading. Hardly much of an improvement.

    For the record, the following is a list of easily fixed editorial errors
    relating to the fields of interest to me that I submitted as part of the review
    process, and which remain unfixed in the latest version of the Unihan database.
    This means that I have to manually preprocess Unihan.txt to correct these errors
    before I can put it through my parsing program -- which is a pain.

    1. kRSUnicode Field

    A. Missing Simplified Radical Marker
    Simplified radicals are indicated by an apostrophe after the radical number in
    the basic CJK block, but not in CJK-A or CJK-B.
    U+4336..4341 Radical 120 should be 120'
    U+4723..4729 Radical 149 should be 149'
    U+478C..4790 Radical 154 should be 154'
    U+4880..4882 Radical 159 should be 159'
    U+497A..4986 Radical 167 should be 167'
    U+49B6..49B8 Radical 169 should be 169'
    U+4B6A Radical 184 should be 184'
    U+4BC3..4BC5 Radical 187 should be 187'
    U+4C9D..4CA4 Radical 195 should be 195'
    U+4D13..4D19 Radical 196 should be 196'
    U+4DAD..4DAE Radical 212 should be 212'
    U+8D5C 154.11 should be 154'.11
    U+8D5D 154.12 should be 154'.12
    U+8F89 159.8 should be 159'.8
    U+987C 181.4 should be 181'.4
    U+9E6B 196.12 should be 196'.12
    U+9EA6 199.0 should be 199'.0
    U+9EFE 205.0 should be 205'.0
    U+9F7F 211.0 should be 211'.0
    U+26208..26221 Radical 120 should be 120'
    U+27BAA Radical 149 should be 149'
    U+27E51..27E57 Radical 154 should be 154'
    U+28405..2840A Radical 159 should be 159'
    U+28C3E..28C56 Radical 167 should be 167'
    U+28DFF..28E0E Radical 169 should be 169'
    U+293FC..29400 Radical 178 should be 178'
    U+29595..29597 Radical 181 should be 181'
    U+29665..29670 Radical 182 should be 182'
    U+297FE..2980F Radical 184 should be 184'
    U+299E6..29A10 Radical 187 should be 187'
    U+29F79..29F8E Radical 195 should be 195'
    U+2A241..2A255 Radical 196 should be 196'
    U+2A388..2A390 Radical 199 should be 199'
    U+2A68F..2A690 Radical 211 should be 211'
    U+2FA18 Radical 205 should be 205'

    2. kMandarin Field

    A. Invalid Application of U-UMLAUT
    LN and LAN are invalid Mandarin pinyin spellings.
    U+6523 kMandarin LAN2 LUAN2 should be LUAN2
    U+7674 kMandarin LAN2 should be LUAN2
    U+7D6F kMandarin LN4 GAI1 should be LUN4 GAI1

    B. Invalid Non-Application of U-UMLAUT
    LUE and NUE should always be written with an umlaut.
    U+3A3C kMandarin LUE4 should be LE4
    U+4588 kMandarin NUE4 should be NE4
    U+458B kMandarin NUE4 should be NE4
    U+63A0 kMandarin LE4 LUE3 should be LE4 LE3
    U+7878 kMandarin NUE4 should be NE4

    C. Other Invalid Pinyin Spellings
    IE, LIONG, YIAN, IONG and YIAO are invalid Mandarin pinyin spellings.
    U+34C8 kMandarin BEI4 BING4 FEI4 IE4 should be BEI4 BING4 FEI4 YE4 ?
    U+3D88 kMandarin LIONG3 YING2 should be YONG3 YING2 ?
    U+66D5 kMandarin YIAN4 should be YAN4
    U+7867 kMandarin IONG3 should be YONG3
    U+9D01 kMandarin YIAO1 should be YAO1

    D. Duplicate Readings
    Whilst there may be a historical reason why some characters in the Unihan
    database originally had multiple duplicate Mandarin readings, there is no reason
    why they should still be there in the latest version.
    U+3561 kMandarin HE2 HE2 HE4 HE4 HUO4
    U+3563 kMandarin YAN3 YAN4 YAN4
    U+356A kMandarin DAN3 DAN3
    U+3613 kMandarin LAN2 LAN2
    U+363A kMandarin FA2 FA2
    U+369C kMandarin XU4 XU4 YU4
    U+38BD kMandarin ER3 ER3
    U+39A9 kMandarin YIN3 YIN3
    U+39E5 kMandarin XIAN3 XIAN3
    U+3C34 kMandarin PO2 POU3 POU3
    U+3C76 kMandarin BENG4 JIAO4 PENG2 PENG2 QIAO3 RU4
    U+3C78 kMandarin BI4 BI4 BIE2
    U+3E2E kMandarin FEN2 FEN2
    U+3F18 kMandarin WA3 WA3
    U+3FC5 kMandarin XIAN3 XUAN3
    U+400E kMandarin MIAN3 MIAN3
    U+4014 kMandarin NIU2 REN4 REN4
    U+414D kMandarin DONG4 DONG4 TING3
    U+4192 kMandarin JIU4 JIU4
    U+42BC kMandarin CHI3 CHI3
    U+431F kMandarin BI4 BI4
    U+443D kMandarin MAN2 MAN2
    U+46AC kMandarin LIN2 LIN2
    U+47A2 kMandarin ZHA4 ZUO2 ZUO2
    U+47F8 kMandarin KUI2 KUI2
    U+4902 kMandarin MEI2 MEI2
    U+4913 kMandarin MENG2 MENG2
    U+49DD kMandarin DI4 YI2 ZHI4 ZHI4
    U+4A8A kMandarin LONG2 LONG2
    U+4B6A kMandarin LIANG2 LIANG2
    U+4C95 kMandarin PU3 PU3 PU4
    U+4DA7 kMandarin YAO3 YAO3
    U+4E0A kMandarin SHANG4 SHANG4
    U+4E21 kMandarin LIANG3 LIANG3 LIANG4
    U+4E24 kMandarin LIANG3 LIANG3 LIANG4
    U+4E4E kMandarin HU1 HU2 HU1
    U+4E86 kMandarin LIAO3 LIAO3
    U+4EC0 kMandarin SHI2 SHEN2 SHI2 SHI2
    U+4EF0 kMandarin YANG3 YANG3 YANG4 ANG2 YANG4 ANG2
    U+4F5D kMandarin KOU4 GOU1 KOU4
    U+4F63 kMandarin YONG4 YONG1 YONG2 YONG2
    U+4F7F kMandarin SHI3 SHI3 SHI4
    U+4F97 kMandarin TONG2 TONG3 DONG4 DONG4 TONG3
    U+4FF3 kMandarin PAI2 PAI2
    U+5012 kMandarin DAO3 DAO4 DAO4
    U+5169 kMandarin LIANG3 LIANG3 LIANG4
    U+5207 kMandarin QIE1 QIE4 QIE4 QI4 QI4 QI4 QI4
    U+5212 kMandarin HUA2 HUA4 HUA2
    U+524A kMandarin XUE1 XIAO1 XUE1
    U+525D kMandarin BO1 BAO1 BO1
    U+52B3 kMandarin LAO2 LAO2 LAO4
    U+52D8 kMandarin KAN1 KAN4 KAN1
    U+52DE kMandarin LAO2 LAO2 LAO4
    U+5395 kMandarin CE4 CE4
    U+53A0 kMandarin CE4 CE4
    U+53C2 kMandarin CAN1 SAN1 SHEN1 DEN1 CEN1 SHEN1 SAN1 SAN3
    U+53C3 kMandarin CAN1 SAN1 SHEN1 DEN1 CEN1 SHEN1 SAN1 SAN1
    U+53CD kMandarin FAN3 FAN3 FAN1
    U+53F6 kMandarin XIE2 YE4 YE4 SHE4
    U+5403 kMandarin CHI1 CHI1 JI1
    U+5408 kMandarin HE2 HE2 GE3
    U+5410 kMandarin TU3 TU4 TU3
    U+5439 kMandarin CHUI1 CHUI1 CHUI4
    U+54D1 kMandarin YA3 YA1 YA3 E4
    U+5527 kMandarin JI1 JI1
    U+555E kMandarin YA3 YA1 YA3 E4
    U+5587 kMandarin LA3 LA3
    U+559E kMandarin JI1 JI1
    U+55AB kMandarin CHI1 CHI1 JI1
    U+561B kMandarin MA5 MA5
    U+5632 kMandarin CHAO2 ZHAO1 ZHAO1
    U+5730 kMandarin DI4 DE5 DI4
    U+575E kMandarin WU4 WU4
    U+5795 kMandarin HOU4 HOU4
    U+57AB kMandarin DIAN4 DIAN4
    U+57E0 kMandarin BU4 BU4
    U+57FD kMandarin SAO4 SAO3 SAO3
    U+5862 kMandarin WU4 WU4
    U+588A kMandarin DIAN4 DIAN4
    U+58A9 kMandarin DUN1 DUN1
    U+58AA kMandarin DUN1 DUN1
    U+58AC kMandarin DI4 DE5 DI4
    U+590D kMandarin FU4 FU4
    U+59E8 kMandarin YI2 YI2
    U+5B32 kMandarin NIAO3 NIAO3
    U+5B81 kMandarin ZHU4 NING2 NING4 NING2
    U+5BD5 kMandarin NING2 NING4 NING2
    U+5BE7 kMandarin ZHU4 NING2 NING4 NING2
    U+5C0A kMandarin ZUN1 ZUN1
    U+5C7F kMandarin YU3 XU4 YU3
    U+5DBC kMandarin YU3 XU4 YU3
    U+5E05 kMandarin SHUAI4 SHUAI4 SHUO4
    U+5E25 kMandarin SHUAI4 SHUAI4 SHUO4
    U+5EC1 kMandarin CE4 CE4
    U+5F04 kMandarin NONG4 LONG4 LONG4
    U+5F0E kMandarin SAN1 SAN1 SAN4
    U+5F1F kMandarin DI4 TI4 TI4
    U+5F87 kMandarin XUN4 XUN2 XUN4
    U+5F95 kMandarin LAI2 LAI2 LAI4
    U+5FA0 kMandarin LAI2 LAI2 LAI4
    U+5FA9 kMandarin FU4 FU4 FOU4
    U+5FF8 kMandarin NIU3 NIU3 N4
    U+608C kMandarin TI4 TI4
    U+60DD kMandarin TANG3 CHANG3 CHANG3
    U+6310 kMandarin RU2 NA2 NU2 RAO2 RU2
    U+6837 kMandarin YANG2 YANG4 XIANG4 YANG2
    U+6BD3 kMandarin YU4 YU4
    U+6C59 kMandarin WU1 WU4 WA1 WU4 YU1
    U+7089 kMandarin LU2 LU2
    U+762A kMandarin PIE1 BIE3 BIE1 BIE3
    U+7A79 kMandarin QIONG1 QIONG1 KONG1 QIONG2
    U+7F5B kMandarin GU1 GU1
    U+7FA1 kMandarin XIAN4 YI2 XIAN4 YAN2 YI2
    U+7FA8 kMandarin XIAN4 YI2 YAN2 YI2
    U+82A6 kMandarin LU2 LU3 LU2
    U+8449 kMandarin XIE2 YE4 YE4 SHE4
    U+8897 kMandarin ZHEN3 ZHEN1 ZHEN3
    U+8910 kMandarin HE2 HE4 HE2
    U+8E42 kMandarin ROU2 ROU2
    U+8EAA kMandarin LIN4 LIN4
    U+8EE4 kMandarin HU1 HU1

