From: Thomas Chan (tc31@cornell.edu)
Date: Fri Feb 14 2003 - 09:46:01 EST
On Thu, 13 Feb 2003, Zhang Weiwu wrote:
>Take it easy, if you find one 500B (the measure word) it is usually enough to
>say it is traditional Chinese, one 4E2A (measure word) is in simplified
>Chinese. They never happen together in a logically correct document.
Others have already given examples of logically correct documents with
both characters, but one cannot always have the luxury of assuming the
data is not deviant. For example, there are many electronic texts online
that are a hybrid of simplified and traditional text, because they contain
erroneous conversions from a simplified source document (typically GB2312)
to a traditional one (typically Big5).
I think zhe4 'this' (simp U+8FD9 / trad U+9019) might be better for a very
simple heuristic for modern text, since it occupies position #11 in at
least one frequency list (compared to #15 for the above-cited ge4), and as
far as I know, U+8FD9 is not one of those ancient characters that have
been promoted/reused as a simplified form.
On Thu, 13 Feb 2003, Andrew C. West wrote:
>Take, for example, this Web page --
>http://uk.geocities.com/Morrison1782/Texts/TianguanCifu.html -- which
>transcribes a short one-act play from the Cantonese Opera tradition, published
>during the Qing dynasty (probably early 19th century). It has U+4E2A
>(simplified
>ge4) but not U+500B (traditional ge4), and yet is written mostly in
>"traditional" characters. How would your algorithm classify such a page ?
Aren't such texts by default "traditional"? "Simplified" text, besides
using simplified form characters, usually also entails refraining from
using variant forms (according to PRC definitions of what is a variant).
And depending on how far one wants to stretch the definition, PRC-style
vocabulary, etc., cf., http://www.cjk.org/cjk/reference/chinvar.htm and
http://www.cjk.org/cjk/c2c/c2cbasis.htm .
On Thu, 13 Feb 2003, Marco Cimarosti wrote:
>The easiest way to do it is "folding" both the user's query and the conten
>being sought to the same form (either traditional or simplified, it doesn't
>matter). It may also help to "fold" also other kinds of variants beside
>simplified and traditional.
It would help to at least fold the Unicode z-variants together. For
example, with the possibility of Unicode data, authors have the choice of
U+6236, U+6237, and U+6238 for hu4 'door', but these are not meaningful
distinctions, and certainly a lot harder to detect than the typical
traditional/simplified case.
On Thu, 13 Feb 2003, Edward H Trager wrote:
>And I've seen books printed in the beginning years of the PRC era using
>mostly simplified, but with smatterings of traditional characters here and
>there. These books were printed in the days of lead type, so I
Those must be the ones printed before the final 1964 version of the
simplification (drafts dating back to 1956, and some earlier pre-1949
usages in Communist-occupied areas), so that they do not utilize all the
simplified characters that eventually exist in the 1964 version.
There are even some cases of semi-simplified forms where one half of a
character might have been simplified according to pre-1964 rules, but the
simplification rule for the other half has to wait until 1964. But I
think these might've been missed by Unicode, like some of the
ultra-simplified forms in the short-lived 1977 scheme, and Singapore's
temporarily different (from the PRC's) schemes prior to 1976.
On Fri, 14 Feb 2003, Andrew C. West wrote:
>Now if Hanyu Da Cidian were to be put onto the internet ...
How about the one here? http://202.109.114.220/
Thomas Chan
tc31@cornell.edu
This archive was generated by hypermail 2.1.5 : Fri Feb 14 2003 - 10:31:42 EST