Re: traditional vs simplified chinese

From: Thomas Chan (tc31@cornell.edu)
Date: Fri Feb 14 2003 - 09:46:01 EST

  • Next message: Michael Everson: "Re: Plane 14 Tag Deprecation Issue"

    On Thu, 13 Feb 2003, Zhang Weiwu wrote:
    >Take it easy, if you find one 500B (the measure word) it is usually enough to
    >say it is traditional Chinese, one 4E2A (measure word) is in simplified
    >Chinese. They never happen together in a logically correct document.

    Others have already given examples of logically correct documents with
    both characters, but one cannot always have the luxury of assuming the
    data is not deviant. For example, there are many electronic texts online
    that are a hybrid of simplified and traditional text, because they contain
    erroneous conversions from a simplified source document (typically GB2312)
    to a traditional one (typically Big5).

    I think zhe4 'this' (simp U+8FD9 / trad U+9019) might be better for a very
    simple heuristic for modern text, since it occupies position #11 in at
    least one frequency list (compared to #15 for the above-cited ge4), and as
    far as I know, U+8FD9 is not one of those ancient characters that have
    been promoted/reused as a simplified form.

    On Thu, 13 Feb 2003, Andrew C. West wrote:
    >Take, for example, this Web page --
    >http://uk.geocities.com/Morrison1782/Texts/TianguanCifu.html -- which
    >transcribes a short one-act play from the Cantonese Opera tradition, published
    >during the Qing dynasty (probably early 19th century). It has U+4E2A
    >(simplified
    >ge4) but not U+500B (traditional ge4), and yet is written mostly in
    >"traditional" characters. How would your algorithm classify such a page ?

    Aren't such texts by default "traditional"? "Simplified" text, besides
    using simplified form characters, usually also entails refraining from
    using variant forms (according to PRC definitions of what is a variant).
    And depending on how far one wants to stretch the definition, PRC-style
    vocabulary, etc., cf., http://www.cjk.org/cjk/reference/chinvar.htm and
    http://www.cjk.org/cjk/c2c/c2cbasis.htm .

    On Thu, 13 Feb 2003, Marco Cimarosti wrote:
    >The easiest way to do it is "folding" both the user's query and the conten
    >being sought to the same form (either traditional or simplified, it doesn't
    >matter). It may also help to "fold" also other kinds of variants beside
    >simplified and traditional.

    It would help to at least fold the Unicode z-variants together. For
    example, with the possibility of Unicode data, authors have the choice of
    U+6236, U+6237, and U+6238 for hu4 'door', but these are not meaningful
    distinctions, and certainly a lot harder to detect than the typical
    traditional/simplified case.

    On Thu, 13 Feb 2003, Edward H Trager wrote:
    >And I've seen books printed in the beginning years of the PRC era using
    >mostly simplified, but with smatterings of traditional characters here and
    >there. These books were printed in the days of lead type, so I

    Those must be the ones printed before the final 1964 version of the
    simplification (drafts dating back to 1956, and some earlier pre-1949
    usages in Communist-occupied areas), so that they do not utilize all the
    simplified characters that eventually exist in the 1964 version.

    There are even some cases of semi-simplified forms where one half of a
    character might have been simplified according to pre-1964 rules, but the
    simplification rule for the other half has to wait until 1964. But I
    think these might've been missed by Unicode, like some of the
    ultra-simplified forms in the short-lived 1977 scheme, and Singapore's
    temporarily different (from the PRC's) schemes prior to 1976.

    On Fri, 14 Feb 2003, Andrew C. West wrote:
    >Now if Hanyu Da Cidian were to be put onto the internet ...

    How about the one here? http://202.109.114.220/

    Thomas Chan
    tc31@cornell.edu



    This archive was generated by hypermail 2.1.5 : Fri Feb 14 2003 - 10:31:42 EST