Unihan Mandarin Readings

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Mon Dec 02 2002 - 05:11:49 EST

  • Next message: Raymond Mercier: "Re: CJK fonts"

    Whilst writing a CJK pinyin lookup utility over the weekend I noticed that for
    some CJK ideographs in the Unihan database that have multiple Mandarin readings,
    the secondary reading(s) have been wrongly associated with adjacent or nearby
    ideographs. For example :

    U+543E kMandarin WU2 YA5
    U+5440 kMandarin YA1
    -- YA5 is another reading for U+5440, but is not a reading for U+543E

    U+54F5 kMandarin BA1 HNG5
    U+54FC kMandarin HENG1
    -- HNG5 is another reading for U+54FC, but is not a reading for U+54F5

    U+963E kMandarin A3 A4 A5 E1 E3 LING3
    U+963F kMandarin A1
    -- A3 A4 A5 E1 E3 are all readings for U+963F, but none are readings for U+963E

    U+97A1 kMandarin ENG1 LA5
    U+97A5 kMandarin YI4
    -- ENG1 is a reading for U+97A5, but is not a reading for U+97A1 (in fact U+97A5
    is the only character in Xiandai Hanyu Cidian that has a reading of ENG)

    I know that the Mandarin readings given in the Unihan database are informative
    only and may not necessarily correspond to the expectations of any given user
    (as we have seen before on this list), but the problem I have noticed is not
    that the readings are wrong or dubious per se, but rather that the correct
    readings have been assigned to the wrong ideographs. This seems only to be a
    problem with ideographs that have multiple Mandarin readings, and only affects
    the non-primary readings (despite the fact that multiple readings are sorted
    alphabetically).

    Having noticed the obvious examples given above, I decided to go through the
    Unihan entries for the CJK Unified Ideographs block to see if there were any
    more examples of misassociated Mandarin readings. I gave up after going through
    the entries for just the first five rows of the CJK Unified Ideographs block,
    which appear to have at least ten misassociated variant or secondary Mandarin
    readings :

    U+4E0C kMandarin FOU1 FOU3 JI1
    U+4E0D kMandarin BU4
    -- FOU1 and FOU3 are secondary readings for U+4E0D, not U+4E0C

    U+4E15 kMandarin LIANG3 LIANG4 PI1
    U+4E12 kMandarin CHOU3
    -- LIANG3 and LIANG4 are variant readings for U+4E12, not U+4E15

    U+4E22 kMandarin DIU1 LIANG4
    U+4E21 kMandarin LIANG3
    -- LIANG4 is a secondary reading for U+4E21, not U+4E22

    U+4E25 kMandarin BANG4 YAN2
    U+4E26 kMandarin BING4
    -- BANG4 is a secondary reading for U+4E26, not U+4E25

    U+4E2B kMandarin YA1 ZHONG4
    U+4E2D kMandarin ZHONG1
    -- ZHONG4 is a secondary reading for U+4E2D, not U+4E2B

    U+4E33 kMandarin CHAN3 LIN4
    U+4E34 kMandarin LIN2
    -- LIN4 is a secondary reading for U+4E34, not U+4E33

    U+4E3B kMandarin LI2 ZHU3
    U+4E3D kMandarin LI4
    -- LI2 is a secondary reading for U+4E3D, not U+4E3B

    U+4E3E kMandarin JU3 NÜE4 TUO1 ZHE4
    -- I'm not sure where the readings NÜE4, TUO1 and ZHE4 belong, but I'm pretty
    certain it isn't U+4E3E

    U+4E49 kMandarin WU4 YI4
    U+4E4C kMandarin WU1
    -- WU4 is a secondary reading for U+4E4C, not U+4E49

    U+4E4F kMandarin FA2 LUO4 YAO4 YUE4
    U+4E50 kMandarin LE4
    -- LUO4, YAO4 and YUE4 are all secondary readings for U+4E50, not U+4E4F

    From the number of misplaced Mandarin readings in this small sample (10 out of
    80 characters) it would seem to me that the problem is probably endemic
    throughout the CJK Unified Ideographs block (I think that CJK-A is OK, but I
    haven't looked carefully enough to be sure).

    Is it possible to regenerate the Unihan database with the correct secondary
    Mandarin readings ?

    Andrew



    This archive was generated by hypermail 2.1.5 : Mon Dec 02 2002 - 05:59:40 EST