Beyond Text Representation -- Building on Unicode to Implement a Multi-lingual Text Analysis Framework
Thomas Hampp-Bahnmueller - IBM Germany
Applications dealing with natural language documents in several languages are faced with various text analysis tasks. All those applications will have to solve the basic tasks of code set conversions and text representation. For those tasks Unicode can provide a solid foundation. But most applications will have to deal with more tasks. Such tasks may range from simple tokenization or dictionary lookup up to more complex tasks like part of speech disambiguation, summarization or even parsing.
We want to present the design and implementation of a flexible TIPSTER inspired software library to facilitate those multi-lingual text analysis tasks. It builds on Unicode for its text layer but also provides means for the representation of lingustic entities beyond the text layer. The library focuses on modularity, code exchange/reuse and configurability. It reaches those goals by separating the application from the implementation modules actually performing the analysis tasks. Implementation modules for various text analysis tasks can be combined, Modules for the same task can be exchanged without any change to the application.
We want to discuss if and how analysis tasks are influenced by building on Unicode as a text representation. The direct influence of Unicode on a task may be range from substantial (e.g. for tokenization) to inconsequential (e.g. for summarization) depending on the task at hand. But regardless of the direct influence of Unicode on an analysis task we will show that none of them could be achieved without a solid text representation to start with.
|When the world wants to talk, it speaks Unicode|
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.
11 December 2000, Webmaster