Seventeenth International Unicode Conference

The Further Pitfalls and Complexities of Chinese to Chinese Conversion

Thomas R. Emerson - Basis Technology Corporation & Jack Halpern - CJK Dictionary Publishing Society

Intended Audience:	Manager, Software Engineer
Session Level:	Advanced

It is understood that Unicode provides an effective pivot when transcoding between legacy CJK encodings. However, converting between Chinese encodings and character sets (e.g., from GB2312 to Big Five) requires more work than merely mapping code-points as the correspondence between GB2312 and Big Five is one to many; a simple mapping table is not sufficient.

In a paper presented at IUC 14[1], Jack Halpern and Jouni Kerman presented an in-depth analysis of the difficulties in accurately converting between Simplified Chinese (SC), used in the People's Republic of China and Singapore, and Traditional Chinese (TC), used in Taiwan, Hong Kong, and Macau. They discussed four progressively more accurate levels of conversion and described the lexical data necessary to achieve each conversion level.

This paper presents a new collection of pitfalls and complexities that we have encountered over the last eighteen months, including:

The difference between Traditional Chinese as specified and used in Mainland China and that used in Taiwan and other locales.
The influence of Cantonese on SC<->TC conversion.
The presence of orthographic variants (such as the use of the simplified form of tai2 in Taiwan) and their effect on segmentation, indexing, and SC<->TC conversion.
The complex interactions between legacy encodings and Unicode.

We discuss various approaches to addressing these problems while providing a detailed discussion of the importance of Chinese to Chinese conversion in effective information retrieval. In so doing we argue that this problem can be viewed as a machine translation task as well as a transcoding task. We also contrast our approach with that presented by Liu et al. at IUC 7.[2]

References

Jack Halpern and Jouni Kerman. "The Pitfalls and Complexities of Chinese to Chinese Conversion". Proceedings of the 14th International Unicode Conference, Cambridge, Massachusetts, USA, March 1999.
Shing-Huan Liu, Chi-Ching Hsu, and Cheng-Ping Chang. "An Automatic Translator Between Traditional Chinese and Simplified Chinese in Unicode". Proceedings of the 7th International Unicode Conference, San Jose, California, USA, September 1995.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

18 Jun 2000, Webmaster