The Unicode standard provides hope that all of the world's languages can be stored and exchanged using a single character set. Storage and exchange of large volumes of data implies the need for information retrieval tools that are able to deal with many languages at once. Each of the world's writing systems and languages requires some special treatment. Issues such as word segmentation, finding inflected forms of words and resolving data differences that are not semantically important to the searcher must be dealt with one language at a time.
This paper explores the issues of segmentation and semantic equivalence that must be addressed if a multilingual stream of Unicode text is to be indexed and retrieved effectively. Examples are drawn from a diverse set of languages to show that algorithms which may be applied effectively to one language may be totally inappropriate in another.
Unicode, with its wide language coverage and its focus on the semantic value of characters, provides a firm foundation for the indexing of the world's data, but Unicode alone is not a solution. Simple pattern matching of data is not sufficient and different languages often require very different treatment.
|When the world wants to talk, it speaks Unicode|
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.
26 January 1999, Webmaster