Issues and Solution in Pan-China Information Retrieval
Thomas Emerson - Basis Technology Corporation
The last ten years has seen a significant effort put into the research and development of information retrieval (IR) systems for Chinese speaking locales. Internet search engines, digital libraries and full-text retrieval systems require effective and accurate indexing and query-processing technology, and the features of the Chinese language limits the applicability of many techniques and algorithms used with Western Languages. A common limitation of all existing Chinese IR systems is their restriction to texts in a single locale.
This paper describes the special issues in Chinese information retrieval, including the trade-offs of indexing using n-gram versus word-based models, the effect these decisions have on the algorithms selected and the way results are presented to the user. This paper also describes the issues in implementing an IR system that works across Chinese locales, taking into account differences in character sets and terminology used in different regions of China. To our knowledge this is the first time such a system has been developed.
We show that extending a search engine across Chinese locales limits the effective choices you have in indexing, character representation and whether or not you perform query-term expansion: accurate word-based indexing is essential, using Unicode, and term expansion is vital when searching documents authored in a locale different from that of the searcher.
At the end of this presentation, you will leave with a better understanding of how these encodings relate and how to deal with them when authoring Chinese-language applications.
|When the world wants to talk, it speaks Unicode|
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.
10 May 2001, Webmaster