Re: CJK fonts

From: Thomas Chan (
Date: Mon Dec 16 2002 - 17:24:48 EST

  • Next message: Barry Caplan: "Re: Documenting in Tamil Computing"

    (I've merged Andrew's two messages--12/13 and 12/16--together, below.)

    On Fri, 13 Dec 2002, Andrew C. West wrote:
    > On Fri, 13 Dec 2002 01:33:08 -0800 (PST), Thomas Chan wrote:
    > > I can't imagine where the yi4 reading comes from, although I note
    > I was thinking along the same lines. The Kangxi Zidian gives U+3CBC a reading of
    > YI4 (as does the Unihan database - the CHA4 reading seems to be as a variant
    > form of U+6C4A).

    What edition of the _Kangxi Zidian_ are you using that gives explicit
    Mandarin readings like "yi4", or are you interpreting the fanqie notation
    yourself? I use the 1958 edition, 1997 2nd printing published by
    Zhonghua, ISBN 7-101-00518-7.

    I find self-interpretation of fanqie to be fraught with peril, partially
    as fanqie was never a completely perfect transcription system, not to
    mention that fanqie from old dictonaries does not necessarily tell one
    anything about contemporary pronunciation.

    e.g., U+5B7B, is a Yue (Cantonese), Hakka, and Min character, meaning
    'last (child)' (derived from 'last child of an old man', hence the
    character's appearance as 'child' + 'to use up'), pronounced laai1 or lai1
    in Cantonese.[1] However, the old dictionaries including Kangxi give a
    fanqie of U+6CE5 U+53F0 U+5207, which would yield an artificial nai2 in
    Mandarin, which is exactly what the _Hanyu Da Zidian_ says explicitly.
    Either the pronunciation has changed from [n-] and [l-] and reading old
    dictionaries fails to account for modern developments, or whoever choose
    U+6CE5 to indicate the onset was pronouncing U+6CE5 as *l-.

    [1] While there is a long-standing ongoing sound change in Cantonese from
    [n-] to [l-], this is probably no longer one of them, and *naai1/nai1
    would now be regarded as hypercorrection.


    > At any rate, what I think is important is that we do not assume that YI4
    > is wrong and through it out just because none of us recognise the reading ...
    > though I guess if it is that obscure, it really hasn't got a place in the Unihan
    > database.

    But what if the character is obscure, and the reading thusly also obscure?
    I think there are diminishing benefits to overly-proofing the
    unihan database for such characters--if they are so rare, then no one will
    find the character by searching on an obscure/artificial reading, and if
    it is so rare, then those interested should be consulting actual
    comprehensive dictionaries (like the Kangxi or _Hanyu Da Zidian_) instead
    of relying on a text file. In a way, we currently have this
    situation--the Plane 2 characters are, on average, more obscure than the
    BMP characters, and the lack of information is kind of saying "look it up
    yourself if you really, really need to know".

    > If Hanyu Da Zidian and Hanyu Da Cidian both give GAN4 for the modern
    > reading of U+5481 I for one would prefer that reading to GEM4. Ci Hai
    > also has such non-Mandarin syllables as NGU2 for U+5514. The principle
    > of Pinyin are clearly defined (and like most PRC dictionaries Ci Hai
    > includes a copy of the Hanyu Pinyin Fang'an as an appendix - even if it
    > does not fully adhere to it), and syllables like GEM4 and NGU2 are
    > simply not allowed.

    I agree with your sentiment that "gem4" is an aberration, despite my
    support of the _Cihai_ (PRC 1979) in that it did not get included in the
    unihan database from out of nowhere. When U+5481 was reinvented by the
    Cantonese, it was patterned both graphically and phonologically on U+7518,
    which is gan1 'sweet' in Mandarin (gam1 in Cantonese). U+5481 is in
    Cantonese gam3 'so (quantity)' (3 = yinqu tone); hence "gan4" is an
    appropriate Mandarin reflex.

    "ngu2" for U+5514 is also an aberration--yet another case of a quixotic
    attempt to mimic dialect pronunciation in Mandarin. Sure, it's m4 (a
    syllabic nasal [m]) 'not' in Cantonese, but this is just a re-use of a
    pre-existing semi-homophonous character, ng4 (another syllabic nasal;
    considered close enough to m4 in Cantonese), a sound in singing. As that
    is wu2 in Mandarin, so thus should 'not' be given an artificial *wu2
    reading (which is what the unihan database has currently--no doubt that
    piece of data was inputted from a more sensible dictionary).

    But elsewhere, this battle is lost--U+5187 'to not have' (among other
    meanings), is perhaps the most recognizable Cantonese character to
    non-Cantonese, is given nowadays given the pronunciation mao3[2], despite
    the recognition of earlier dictionary compilers such as Samuel Wells
    Williams in his 1877 dictionary who recognized it as derived from U+7121
    with a tone change, and assigned it a Mandarin wu3 reading accordingly.

    [2] I note that even "mao" is a poor approximation; "*mou" would've been
    closer (and still a valid and normal Mandarin syllable).

    > On the other hand, a reading of FIAO4 for the dialectal ideograph U+8985
    > may sound odd to a Mandarin speaker, but it is perfectly acceptable
    > according to the rules of Pinyin ("F" is a valid initial, and "IAO4" is
    > a valid final). FIAO4 is the only reading for this ideograph given in
    > Hanyu Da Cidian, Ci Hai and Xiandai Hanyu Cidian, but interestingly,
    > Unihan gives it a reading of BIAO4 - not sure where that reading comes
    > from.

    Thank you for pointing out this Wu character to me.[3] The artificial
    Mandarin reading of this character is a difficult case. Both the _Cihai_
    and the _Hanyu Da Zidian_ seem to say that this character, which is a
    contraction of 'do not want', is not a typical Wu syllable, though
    apparently pronouncable (syllables existing on the borderline also exist
    in Cantonese phonology, typically in loanwords or onomatopoeia), and
    therefore U+8985 had to be created as a "ligature" of sorts by squishing
    the constituents U+52FF U+8981 into the space normally occupied by one
    character. Therefore, I don't think it odd that there is a
    semi-questionable Mandarin fiao4 reading. The _Hanyu Da Zidian_ does not
    try to give a Mandarin reading in this case, so we still don't know where
    "biao4" came from in the unihan database (or "po4", for that
    matter--unless that is a case of shuffled data that started this whole
    thread). I do note that besides U+8985, a similar-looking interchangeable
    character (but with the halves swapped) is right next to it in _Hanyu Da
    Zidian_. On the Wu pronunciation, I can't comment on it myself, except
    that I see in the _Hanyu Fangyan Cihui_, 2nd ed. that in Suzhou, they say
    [fiae] ("ae" = <ae> ligature) and in Wenzhou, they say [fai]; however, for
    the latter city, it is said to be a contraction of U+5426 U+8981 instead.
    So in a way, the [f-] of the Mandarin reading is justifable; (I don't know
    enough to comment on the rest of the syllable or tone choice.

    unihan.txt says that U+8985 is in Morohashi--perhaps that might be where
    "biao4" came from?--I don't have access to a Morohashi to check.

    [3] A nice example of the sporadic and often accidental coverage of
    non-Mandarin and non-Yue (Cantonese) characters in Unicode. Wu's U+8985
    is in the BMP, yet a contemporary Mandarin character such as cei3
    'ugly'/cei4 'to hit' winds up in Plane 2 as U+24B62.

    >Sorry if this is getting somewhat OT.

    The same here. I'm fine with taking this privately, but I thought there
    might be some interest in sharing it here, as there are people who are
    using kMandarin quite literally as a "informative" field as their
    primary/sole data...

    Thomas Chan

    This archive was generated by hypermail 2.1.5 : Mon Dec 16 2002 - 13:54:56 EST