Re: CJK fonts

From: Andrew C. West (
Date: Tue Dec 17 2002 - 08:03:09 EST

  • Next message: Marco Cimarosti: "RE: converting devanagari to mangal unicode"

    On Tue, 17 Dec 2002 02:25:13 -0800 (PST), Thomas Chan wrote:

    > What edition of the _Kangxi Zidian_ are you using that gives explicit
    > Mandarin readings like "yi4", or are you interpreting the fanqie notation
    > yourself? I use the 1958 edition, 1997 2nd printing published by
    > Zhonghua, ISBN 7-101-00518-7.

    I've got two Zhonghua Shuju editions, one published in Hong Kong, and one
    published in Beijing - the pagination is different but they are both facsimile
    reprints of the same original edition. I'm interpretting the fanqie notation. In
    the case of YI4 (U+3CBC), Kangxi quotes Guang Yun as having a fanqie notation of
    U+9B5A [YU2] / U+ 80BA [FEI4], whilst it quotes Ji Yun as having a fanqie
    notation of U+9B5A [YU2] / U+ 5208 [YI4], with the additional note, pronounced
    the same as U+4E42 [YI4], which is fairly unambiguous.

    > I find self-interpretation of fanqie to be fraught with peril, partially
    > as fanqie was never a completely perfect transcription system, not to
    > mention that fanqie from old dictonaries does not necessarily tell one
    > anything about contemporary pronunciation.

    Agreed, I would not use fanqie readings as evidence for contemporary
    pronunciations, but the fanqie readings for obscure and obsolete ideographs
    given in Ji Yun, Guang Yun, Yu Pian etc. and quoted in dictionaries like the
    Kangxi Zidian are our main evidence for their pronunciation. Where do modern
    dictionaries like Hanyu Da Zidian, Hanyu Da Cidian, Ci Yuan and Ci Hai get their
    pinyin readings of obscure and obsolete ideographs ? Presumably from the fanqie
    readings (that may date back to the Tang dynasty) in pre-modern dictionaries.
    For example, what about the reading of HAN2 for U+5481 when meaning "milk" that
    is given in Hanyu Da Cidian and Hanyu Da Zidian. The only reference Hanyu Da
    Cidian gives for this reading is to Yu Pian, and all that my edition of "Songben
    Yu Pian" says of the character is "U+4E73, X,Y qie" (can't remember the actual
    fanqie notation given, but I'm sure it correlates to a reading of something like
    HAM2 which would be Mandarinised to HAN2). Given that probably nobody's used
    U+5481 to mean "milk" for a thousand years, Yu Pian's fanqie reading is all we
    have to go on.

    > e.g., U+5B7B, is a Yue (Cantonese), Hakka, and Min character, meaning
    > 'last (child)' (derived from 'last child of an old man', hence the
    > character's appearance as 'child' + 'to use up'), pronounced laai1 or lai1
    > in Cantonese.[1] However, the old dictionaries including Kangxi give a
    > fanqie of U+6CE5 U+53F0 U+5207, which would yield an artificial nai2 in
    > Mandarin, which is exactly what the _Hanyu Da Zidian_ says explicitly.
    > Either the pronunciation has changed from [n-] and [l-] and reading old
    > dictionaries fails to account for modern developments, or whoever choose
    > U+6CE5 to indicate the onset was pronouncing U+6CE5 as *l-.
    > [1] While there is a long-standing ongoing sound change in Cantonese from
    > [n-] to [l-], this is probably no longer one of them, and *naai1/nai1
    > would now be regarded as hypercorrection.

    I suspect that this is a whole new can of worms, and I don't feel qualified to
    make any comment without the safety net of Wang Li or Karlgren ... I'll think
    about this at home, and get back to you off-list if I have anything sensible to

    > But what if the character is obscure, and the reading thusly also obscure?
    > I think there are diminishing benefits to overly-proofing the
    > unihan database for such characters--if they are so rare, then no one will
    > find the character by searching on an obscure/artificial reading, and if
    > it is so rare, then those interested should be consulting actual
    > comprehensive dictionaries (like the Kangxi or _Hanyu Da Zidian_) instead
    > of relying on a text file. In a way, we currently have this
    > situation--the Plane 2 characters are, on average, more obscure than the
    > BMP characters, and the lack of information is kind of saying "look it up
    > yourself if you really, really need to know".

    Agreed. Reiterating my comment below, maybe the Unihan Mandarin readings should
    be completely rewritten based on Hanyu Da Zidian.

    > I agree with your sentiment that "gem4" is an aberration, despite my
    > support of the _Cihai_ (PRC 1979) in that it did not get included in the
    > unihan database from out of nowhere.

    Yes, your'e probably right that the Unihan reading of GEM4 is not a mistake as
    such, but a reading derived from Ci Hai - the readings for the basic CJK range
    probably pre-date the publications of the more reliable Hanyu Da Cidian and
    Hanyu Da Zidian. If anyone has nothing better to do with their time they might
    consider completely rewriting the Unihan Mandarin readings using the Hanyu Da
    Zidian as the primary (or even sole) source.

    > When U+5481 was reinvented by the
    > Cantonese, it was patterned both graphically and phonologically on U+7518,
    > which is gan1 'sweet' in Mandarin (gam1 in Cantonese). U+5481 is in
    > Cantonese gam3 'so (quantity)' (3 = yinqu tone); hence "gan4" is an
    > appropriate Mandarin reflex.
    > "ngu2" for U+5514 is also an aberration--yet another case of a quixotic
    > attempt to mimic dialect pronunciation in Mandarin. Sure, it's m4 (a
    > syllabic nasal [m]) 'not' in Cantonese, but this is just a re-use of a
    > pre-existing semi-homophonous character, ng4 (another syllabic nasal;
    > considered close enough to m4 in Cantonese), a sound in singing. As that
    > is wu2 in Mandarin, so thus should 'not' be given an artificial *wu2
    > reading (which is what the unihan database has currently--no doubt that
    > piece of data was inputted from a more sensible dictionary).
    > But elsewhere, this battle is lost--U+5187 'to not have' (among other
    > meanings), is perhaps the most recognizable Cantonese character to
    > non-Cantonese, is given nowadays given the pronunciation mao3[2], despite
    > the recognition of earlier dictionary compilers such as Samuel Wells
    > Williams in his 1877 dictionary who recognized it as derived from U+7121
    > with a tone change, and assigned it a Mandarin wu3 reading accordingly.
    > [2] I note that even "mao" is a poor approximation; "*mou" would've been
    > closer (and still a valid and normal Mandarin syllable).

    MOU3 would be a better approximation for MAO3, but I guess that MAO3 is the
    chosen reading on the pattern of Mandarin words pronounced MAO that are
    pronounced MOU in Cantonese (Chairman Mou for example ? -- correct me if I'm
    wrong, I'm only a novice at Cantonese).

    > Thank you for pointing out this Wu character to me.[3] The artificial
    > Mandarin reading of this character is a difficult case. Both the _Cihai_
    > and the _Hanyu Da Zidian_ seem to say that this character, which is a
    > contraction of 'do not want', is not a typical Wu syllable, though
    > apparently pronouncable (syllables existing on the borderline also exist
    > in Cantonese phonology, typically in loanwords or onomatopoeia), and
    > therefore U+8985 had to be created as a "ligature" of sorts by squishing
    > the constituents U+52FF U+8981 into the space normally occupied by one
    > character.

    There are a number of portmanteau characters like FIAO4 - the Mandarin BENG2
    U+752D (BU4 on top of YONG4) corresponds closely to FIAO4.

    > Therefore, I don't think it odd that there is a
    > semi-questionable Mandarin fiao4 reading. The _Hanyu Da Zidian_ does not
    > try to give a Mandarin reading in this case, so we still don't know where
    > "biao4" came from in the unihan database (or "po4", for that
    > matter--unless that is a case of shuffled data that started this whole
    > thread).

    I'm afraid so. Unihan 3.0 only gives BIAO4 -- PO4 is misplaced from U+8987.

    > I do note that besides U+8985, a similar-looking interchangeable
    > character (but with the halves swapped) is right next to it in _Hanyu Da
    > Zidian_. On the Wu pronunciation, I can't comment on it myself, except
    > that I see in the _Hanyu Fangyan Cihui_, 2nd ed. that in Suzhou, they say
    > [fiae] ("ae" = <ae> ligature) and in Wenzhou, they say [fai]; however, for
    > the latter city, it is said to be a contraction of U+5426 U+8981 instead.
    > So in a way, the [f-] of the Mandarin reading is justifable; (I don't know
    > enough to comment on the rest of the syllable or tone choice.
    > unihan.txt says that U+8985 is in Morohashi--perhaps that might be where
    > "biao4" came from?--I don't have access to a Morohashi to check.

    Nor me.

    > [3] A nice example of the sporadic and often accidental coverage of
    > non-Mandarin and non-Yue (Cantonese) characters in Unicode. Wu's U+8985
    > is in the BMP, yet a contemporary Mandarin character such as cei3
    > 'ugly'/cei4 'to hit' winds up in Plane 2 as U+24B62.

    I guess the reason is that the former (U+8985) occurs in a literary context (I
    think that the example Hanyu Da Cidian gives is from some early 20th century
    novel about Shanghai), whereas the latter (U+24B62) is primarily spoken. I know
    the word CEI4 with the meaning "to be broken" (as of crockery), which is a
    meaning that is obviously related to "to hit" and "ugly", but I don't know how I
    would write it. I seem to think that I've seen it written, but I can't remember
    whether it looked like U+24B62 or not. As a Mandarin syllable CEI is not
    mentioned in the Unihan database, which is a shame. Nor, by the way, is TEI,
    which is the more attractive pronunciation of U+5FD2.

    > The same here. I'm fine with taking this privately, but I thought there
    > might be some interest in sharing it here, as there are people who are
    > using kMandarin quite literally as a "informative" field as their
    > primary/sole data...

    Well we're no more OT then half of the postings on this list, and (from my point
    of view) a lot more interesting than most.



    This archive was generated by hypermail 2.1.5 : Tue Dec 17 2002 - 08:41:43 EST