Re: traditional vs simplified chinese

From: Andrew C. West (
Date: Fri Feb 14 2003 - 11:42:23 EST

  • Next message: John Cowan: "Re: traditional vs simplified chinese"

    On Fri, 14 Feb 2003 07:45:44 -0800 (PST), Thomas Chan wrote:

    > I think zhe4 'this' (simp U+8FD9 / trad U+9019) might be better for a very
    > simple heuristic for modern text, since it occupies position #11 in at
    > least one frequency list (compared to #15 for the above-cited ge4), and as
    > far as I know, U+8FD9 is not one of those ancient characters that have
    > been promoted/reused as a simplified form.

    On the other hand I don't think that zhe4 is used in Cantonese, whereas I think
    that ge4 is, so it wouldn't be so good for pages written in Cantonese (not that
    I have ever seen any, but I'm sure there must be some). Probably even a simple
    heuristic would need to try several common characters such as ge4 and zhe4.

    > Aren't such texts by default "traditional"? "Simplified" text, besides
    > using simplified form characters, usually also entails refraining from
    > using variant forms (according to PRC definitions of what is a variant).

    Probably true, but the point that I was making is that the simplified ge4 in the
    text would confuse a simple heuristic.

    > There are even some cases of semi-simplified forms where one half of a
    > character might have been simplified according to pre-1964 rules, but the
    > simplification rule for the other half has to wait until 1964. But I
    > think these might've been missed by Unicode, like some of the
    > ultra-simplified forms in the short-lived 1977 scheme, and Singapore's
    > temporarily different (from the PRC's) schemes prior to 1976.

    I think that most of the 1977 simplifications have already been encoded in
    Unicode, but any that haven't and the hybrid semi-simplified forms found in some
    printed books from the 50s and 60s will probably be included in CJK-C along with
    the rest of its unnecessary baggage (excuse my distaste for CJK-C, but I think
    that the Ideographic Rapporteur Group is indiscrimately collecting characters
    that in most cases probably do not needed to be encoded, just for the sake of
    encoding as many characters as possible - 24,000+ and counting - see the "CJK
    Extension C Project" at
    for details).

    > >Now if Hanyu Da Cidian were to be put onto the internet ...
    > How about the one here? <a

    Yes, this is an excellent resource. Although the Hanyu Da Cidian look-up only
    gives definitions, and none of the extremely useful quotations found in the
    printed book, it still mixes traditional form head words with simplified
    definitions, so that both ge4 simplified and traditional are found together on
    the same page if you search under U+500B and look at the appended compound
    words. I guess that according to Thomas's definition of Simplified Chinese, this
    makes it a Traditional Chinese page, even though most of the text is in
    simplified Chinese !?

    Incidentally, for those interested in UTF-16 Chinese web pages, I noticed that
    this site is encoded as UTF-16LE.

    On a related matter, I was wondering about language tagging for Chinese. "zh-CN"
    and "zh-TW" are used quite frequently, but what do they imply ? Is an HTML page
    tagged as "zh-CN" expected to be composed of simplified characters, and a a page
    tagged as "zh-TW" expected to be traditional characters ? Or does the CN or TW
    imply nothing about the orthography of the text, in which case the CN or TW may
    simply allow selection of an appropriate font ? What if I am writing a Chinese
    page here in England - should I put "zh-UK" or should I make a political
    decision as to whose side I'm on, and use "zh-CN" or "zh-TW" ?

    On the other hand, "zh-simplified" and "zh-traditional" are sometimes found.
    These tags are less politically charged, but miss out on mixed
    simplified/traditional pages. Is there a "zh-mixed" ?


    This archive was generated by hypermail 2.1.5 : Fri Feb 14 2003 - 12:57:45 EST