Re: traditional vs simplified chinese

From: Edward H Trager (ehtrager@umich.edu)
Date: Thu Feb 13 2003 - 11:06:22 EST

  • Next message: Doug Ewell: "Re: newbie: unicode (when used as a coding) = UTF16LE?"

    Hi, Paul,

    On Thu, 13 Feb 2003, Zhang Weiwu wrote:

    > ----- Original Message -----
    > From: "Paul Hastings" <paul@tei.or.th>
    > To: "Zhang Weiwu" <weiwuzhang@hotmail.com>
    > Sent: Thursday, February 13, 2003 9:16 PM
    > Subject: Re: traditional vs simplified chinese
    >
    > > >meaning "for" (wei in Mandarin pinyin) is the most significant recognizable
    > > >one.
    >
    > Take it easy, if you find one 500B (the measure word) it is usually
    > enough to say it is traditional Chinese, one 4E2A (measure word) is in
    > simplified Chinese. They never happen together in a logically correct
    > document.

    So I think Zhang Weiwu is suggesting a heuristic algorithm for
    discriminating a unicode text which is already known, or assumed to be, in
    Chinese.

    If I were going to write such an algorithm, I would:

     * First, insure that the incoming text stream to be classified was
       sufficiently long to be probabilistically classifiable. In other
       words, what's the shortest stream of Hanzi characters needed, on
       average, in a typical Chinese text (on the web, for example) in order
       to encounter at least one "ge" u+500B or u+4E2A? One "wei" u+70BA or
       u+4E3A? One "shuo" u+8AAC or u+8BF4? It wouldn't take long to figure
       this out.

     * Secondly, as I imply above, I would test for the occurrences of
       multiple common characters like "ge" u+500B, "wei" u+70BA, "shuo" u+8AAC.
       Again, if I were doing this, I would want to know, statistically,
       what are really the most common characters? Maybe the top 10 most
       common characters would be sufficient.

    In practice, such an algorithm would probably work very well. But, as
    Marco Cimarosti has questioned, why do you need to classify text as being
    simplified or traditional?

    One reason I could think of doing that would be as a convenience for
    visitors to a web site whose source documents were a mix of traditional
    and simplified Chinese. Take, for example, a site that provided links to
    news from Mainland, Taiwan, HK, etc. So, a visitor could choose whether
    he wanted to see the site in traditional or simplified characters. It
    wouldn't matter whether the source documents were in simplified or
    traditional characters. The classification algorithm would classify a
    document on the fly before display.

    Based on the classification, a "conversion" algorithm would swap the set
    of most common characters that are visually different between jianti
    (simplified) and fanti (traditional) zi using a simple lookup table. I
    don't remember how big this set of characters is. It wouldn't have to be
    complete. And I would intentionally avoid the "problematic" characters --
    i.e., those simplified characters that can map back to several different
    traditional characters having different meanings. Converting just the
    most common, non-problematic characters between simplified and traditional
    would already be sufficient for fluent readers to guess, decipher, or
    recall from the depths of their memories those few unconverted characters
    with which they may be unfamiliar with reading.

    So, basically all you would be doing is providing a convenience for your
    readers, making it easier on their eyes to read your web documents in
    either traditional or simplified according to their preference. I know
    that something like that would help me -- sometimes I forget the
    traditional version of a character, and sometimes I forget the simplified
    version. It would be very cool if I could just press a button on a web
    site to switch the display between the two ;-) .



    This archive was generated by hypermail 2.1.5 : Thu Feb 13 2003 - 11:51:55 EST