From: Edward H Trager (email@example.com)
Date: Thu Feb 13 2003 - 15:26:55 EST
On Thu, 13 Feb 2003, Andrew C. West wrote:
> On Thu, 13 Feb 2003 09:48:45 -0800 (PST), "Zhang Weiwu" wrote:
> > Take it easy, if you find one 500B (the measure word) it is usually enough to
> > say it is traditional Chinese, one 4E2A (measure word) is in simplified
> > Chinese. They never happen together in a logically correct document.
> Marco is absolutely correct that Simplified and Traditional Chinese may
> legitimately be found together on the same Web page (and I for one have several
> pages where they do).
> Just adding my two fens worth, Traditional/Simplified is an artificial modern
> distinction that has been exacerbated by the GB simplified-only coding standards
> on the one hand and traditional-only coding standards such as Big5 on the other,
> which forced people to use either Simplified or Traditional characters
> exclusively. Most simplified characters have in fact been around for centuries,
> and if you open the pages of any down-market commercial edition of a Chinese
> book printed during the Yuan, Ming or Qing dynasties (last 700 years) you are
> likely to find plenty of "simplified" forms mixed up with "traditional" forms.
And I've seen books printed in the beginning years of the PRC era using
mostly simplified, but with smatterings of traditional characters here and
there. These books were printed in the days of lead type, so I
always assumed that they just ran out of the trays of simplified type, but
didn't want to stop the presses ...
> Certainly, I've seen "traditional" texts which mix U+500B with U+4E2A (and with
> U+7B87 for that matter). With Unicode it is now possible to transcribe
> traditional texts as they are written, rather than translate into "traditional"
> or "simplified". Take, for example, this Web page --
> http://uk.geocities.com/Morrison1782/Texts/TianguanCifu.html -- which
> transcribes a short one-act play from the Cantonese Opera tradition, published
> during the Qing dynasty (probably early 19th century). It has U+4E2A (simplified
> ge4) but not U+500B (traditional ge4), and yet is written mostly in
> "traditional" characters. How would your algorithm classify such a page ?
> Also, you should remember that a Chinese page written in Classical Chinese --
> and there are plenty of electronic editions of the Classics on the Web -- might
> have no instances of the vernacular character ge4 at all.
Right, there are all kinds of exceptions so a heuristic algorithm based on
most common characters will have limitations. Still, in practice these
things can be quite useful even if they are imperfect for certain
This archive was generated by hypermail 2.1.5 : Thu Feb 13 2003 - 16:01:22 EST