Segmenting Chinese in Unicode

Tom Emerson - Basis Technology Corporation

Intended Audience: Manager, Software Engineer
Session Level: Intermediate

The automatic segmentation of Chinese text is an ongoing problem in information retrieval and computational linguistics. Because Chinese words are non-space separated, for many processes which require processing words (e.g., search engines) the word boundaries need to be algorithmically determined.

This presentation illustrates one Unicode-based approach that was taken for Basis Technology's simplified and traditional Chinese text segmentation system, the Chinese Morphological Analyzer. Segmentation is based on a very large dictionary of Chinese words with part-of-speech information and Chinese morphological knowledge. The talk will cover how unknown words are dealt with using Chinese word formation rules and grammatical information. Common segmentation problems and their solutions will also be discussed.

The engine uses Unicode throughout, allowing it to seamlessly handle traditional and simplified Chinese text, including ideographs used only in certain Chinese locales such as Hong Kong. Diverse applications of the segmentation engine will also be covered: Chinese-to-Chinese script conversion, keyword extraction for information retrieval, and content filtering.

When the world wants to talk, it speaks Unicode
Unicode Standard Program Conference Board WWW9 Talks and Papers Past Conferences
Showcase Registration Accommodation Travel Sponsors Next Conference
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

31 October 1999, Webmaster