The Unicode standard provides hope that all of the world's languages can be stored and exchanged using a single character set. Storage and exchange of large volumes of data implies the need for information retrieval tools that are able to deal with many languages at once. Each of the world's writing systems and languages requires some special treatment. Issues such as word segmentation, finding inflected forms of words and resolving data differences that are not semantically important to the searcher must be dealt with one language at a time.

This paper explores the issues of segmentation and semantic equivalence that must be addressed if a multilingual stream of Unicode text is to be indexed and retrieved effectively. Examples are drawn from a diverse set of languages to show that algorithms which may be applied effectively to one language may be totally inappropriate in another.


Unicode, with its wide language coverage and its focus on the semantic value of characters, provides a firm foundation for the indexing of the world's data, but Unicode alone is not a solution. Simple pattern matching of data is not sufficient and different languages often require very different treatment.

