The Further Pitfalls and Complexities of Chinese to Chinese Conversion

Thomas R. Emerson - Basis Technology Corporation & Jack Halpern - CJK Dictionary Publishing Society

Intended Audience: Manager, Software Engineer
Session Level: Advanced

It is understood that Unicode provides an effective pivot when transcoding between legacy CJK encodings. However, converting between Chinese encodings and character sets (e.g., from GB2312 to Big Five) requires more work than merely mapping code-points as the correspondence between GB2312 and Big Five is one to many; a simple mapping table is not sufficient.

In a paper presented at IUC 14[1], Jack Halpern and Jouni Kerman presented an in-depth analysis of the difficulties in accurately converting between Simplified Chinese (SC), used in the People's Republic of China and Singapore, and Traditional Chinese (TC), used in Taiwan, Hong Kong, and Macau. They discussed four progressively more accurate levels of conversion and described the lexical data necessary to achieve each conversion level.

This paper presents a new collection of pitfalls and complexities that we have encountered over the last eighteen months, including:

  • The difference between Traditional Chinese as specified and used in Mainland China and that used in Taiwan and other locales.
  • The influence of Cantonese on SC<->TC conversion.
  • The presence of orthographic variants (such as the use of the simplified form of tai2 in Taiwan) and their effect on segmentation, indexing, and SC<->TC conversion.
  • The complex interactions between legacy encodings and Unicode.

We discuss various approaches to addressing these problems while providing a detailed discussion of the importance of Chinese to Chinese conversion in effective information retrieval. In so doing we argue that this problem can be viewed as a machine translation task as well as a transcoding task. We also contrast our approach with that presented by Liu et al. at IUC 7.[2]


