Re: GBK Traditional to Simplified mapping table

From: Thomas Chan (thomas@atlas.datexx.com)
Date: Thu Jan 10 2002 - 22:39:48 EST


On Thu, 10 Jan 2002, Ken Krugler wrote:

> I've got GBK-encoded text that contains a number of Traditional Hanzi
> characters. I'd like to convert all of these to their Simplified
> equivalents. So does anybody know of a GBK table that maps each
> Traditional form to its Simplified form?

If converting to "simplified equivalents" means reducing the text so that
it can be representable in GB2312, then I'd recommend:
  1) If the GBK character is in GB2312, keep it as-is.
  2) Otherwise, convert to Big5 using Unicode as an intermediary. Take
     the characters that converted to Big5 successfully and use one of
     those many Big5->GB2312 converters as suggested by Frank Tang,
     which will perform the traditional->simplified conversion.
  3) If there are any characters that weren't handled by step #2 (e.g.,
     traditional Chinese characters not in Big5[1]; traditional Chinese
     characters in Big5 but not treated by most Big5->GB2312 converters[2];
     non-Chinese characters used in Japanese[3]/Korean since the source text
     *is* GBK), then probably turning them and the surrounding context
     over to a human with access to a number of good dictionaries would
     probably be the best way to (hopefully) find a "best fit" within
     the circumstances (e.g., if it happens to be a variant of a
     character that is in GB2312[4]). If even that fails, perhaps the
     character in question can be described graphically ala "A+B"[5] or
     the text in question rewritten[6].

[1] e.g., U+5700 (GBK 0x87F3) is a variant form of guo2 'country' that
    is not in Big5, but one can substitute U+56FD (GB2312 0xB9FA),
    the form of guo2 'country' used in simplified Chinese.
[2] e.g., U+5187 (GBK 0x83D3) is in Big5, used primarily to write mou
    'not' in Cantonese (but other meanings also exist), but I haven't seen
    a converter to GB2312 yet that'll substitute U+65E0 (GB2312 0xCEDE),
    a near-synonym and etymologically-related character.
[3] e.g., U+7A93 (GBK 0xB799) is a Japanese form of chuang1 'window', but
    one can substitute U+7A97 (GB2312 0xB4B0).
[4] See [1], [2], [3].
[5] i.e., as the combination of its components.
[6] e.g, U+72C6 (GBK 0xA0F0) occurs in Big5 and in most Chinese texts
    encountered, it means 'Japanese spaniel dog; Japanese Chin' (and not
    a pejorative ethnonym), which'll have to be rewritten to whatever
    phrasing that dog breed goes under in GB2312 simplified Chinese texts.

Thomas Chan
tc31@cornell.edu



This archive was generated by hypermail 2.1.2 : Thu Jan 10 2002 - 22:07:47 EST