Unicode in Natural Language Processing Applications
Thomas Emerson - Basis Technology Corporation
Traditionally, natural language processing (NLP) applications are written to solve a single problem in a single language. However in the last several years it is more common to see NLP frameworks being developed targeted to applications in several languages. Nevertheless, these applications are often limited to handling languages that share a common script (e.g., Western European languages alone) or common encoding scheme (e.g., ISO 8859-n).
This talk outlines the benefits of Unicode when writing natural language processing applications that need to be targeted to multiple languages. This is of particular interest to the members of the European Union with the potential doubling in size of the EU over the next year; they will have over 20 official languages in multiple scripts and encodings.
By unifying on a single character representation, especially one with the extended character semantics defined in Unicode, implementing NLP applications becomes significantly easier. This talk will show how Basis Technology was able to leverage linguistic technology developed for Chinese and Japanese to several European languages.
|When the world wants to talk, it speaks Unicode|
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.
21 February 2002, Webmaster