Language identification and IT: Addressing Problems of Linguistic Diversity on a Global Scale

Peter Constable & Gary Simons - SIL International

Intended Audience: Manager, Software Engineer
Session Level: Beginner, Intermediate, Advanced

Information technologies, particularly the internet, are rapidly becoming more global in focus. At the same time, and partly as a result, economic development is quickly expanding in many previously lesser-developed regions of the world. One of the implications of this is that IT systems are being confronted with the challenges of the world's ethno-linguistic diversity.

Considerable and productive effort is being made to create adequate I18N infrastructures for issues such as text encoding and processing in IT systems. Yet at the same time, infrastructures for dealing with issues of language and locale identification are lagging behind user needs. The connection between how text is encoded and how it should be processed cannot be properly closed until the language identification problem is solved, since so many aspects of text processing (like collating and spell-checking) are language specific.

At present we are confronted with an issue of scale. The leading standard for addressing language identification, ISO 639-2, offers codes to identify approximately 450 languages. In fact, the number of languages spoken in the world today exceeds 6000, as is documented in SIL's online catalogue of the world's languages (http://www.sil.org/ethnologue). The problem is that the world's linguistic diversity is at the same time very complex but well understood by relatively few.

In this paper, we will explore the world's ethno-linguistic diversity, it's challenges for IT, and some directions in which we can move forward toward solutions. In particular, we will

- give an overview of the world's ethno-linguistic diversity;
- discuss some of the inherent difficulties in devising systems of language and locale identification;
- examine some existing IT practices and their successes and limitations; and
- present work that SIL is doing in relation to language identification that can provide at least part of a needed solution for global IT systems.

