RE: traditional vs simplified chinese

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu Feb 13 2003 - 13:40:35 EST

  • Next message: Marco Cimarosti: "RE: Indic Vowel/Consonant combinations"

    Edward H Trager wrote:
    > [...]
    > If I were going to write such an algorithm, I would:
    >
    > * First, insure that the incoming text stream to be classified was
    > sufficiently long to be probabilistically classifiable. In other
    > words, what's the shortest stream of Hanzi characters needed, on
    > average, in a typical Chinese text (on the web, for example) in
    > order to encounter at least one "ge" u+500B or u+4E2A? One "wei"
    > u+70BA or u+4E3A? One "shuo" u+8AAC or u+8BF4? It wouldn't take
    > long to figure this out.

    Lucky man! I was discussing about a similar subject just yesterday, and
    someone came up with this link:

            http://lingua.mtsu.edu/chinese-computing/statistics/

    The figures in file <total.html> make it easy to answer your question: in a
    typical text, ? (ge) is the 3.54%, ? (wei) the 1.96%, ? (shuo) the 2,58%,
    etc.

    _ Marco



    This archive was generated by hypermail 2.1.5 : Thu Feb 13 2003 - 14:37:28 EST