Twenty-first International Unicode Conference

Language Identification in a Unicode Environment

Thomas Hampp-Bahnmueller - IBM Deutschland GmbH

Intended Audience:	Software Engineers, Systems Analysts
Session Level:	Intermediate

The task of Language Identification - to determine the language in which a given text is written - is becoming more and more important. Most advanced text processing is language specific and needs to know the language of the text. Unfortunately most documents are not explicitly marked up with respect to the language in which they are written.

Over the last years quite a few techniques have been developed to solve this problem. Most of them work quite well and some of them have successfully been put to use in software products.

But most of the algorithms have been designed to work on single byte, Latin 1 encodable, European languages. Creating a language identification algorithm that is designed to work in a 100% Unicode environment adds new potential and problems to the task.

This talk will describe the task of Language Identification and the most common solutions to the problem. We will try to give an outline of the state of the art in terms of the languages covered and the identification quality achieved.

We will then discuss the additional challenges added by performing Language Identification in a Unicode environment - prominently the huge number of potential languages - and the new potential Unicode offers - prominently the script information available in Unicode.

We will describe a concrete example of a system incorporating Unicode specific processing to solve the language identification task and outline the coverage and quality of it.

A final note will touch on the relation of Code Page Identification and Language Identification and on Language Identification for multi-lingual documents.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

21 February 2002, Webmaster