Case Study: Porting an NLP Application to Unicode
Nicolas Auclerc - ATR-SLT
TOPICS OF INTEREST:Language processing issues with unicode data also could be: Migrating legacy applications to Unicode
Our natural language processing research group has taken the decision of adopting Unicode for all its data. Consequently in the process of rewriting one of our already existing applications, namely a graphical tool for tree-banking with parsing aids, we had to integrate the support of Unicode. Originally, the application in question was intended for English (1byte) and for Japanese (2-byte) only. The new specifications included a redesign as a client/server application. We "killed two birds with one stone" by using Java. This eased both the use of Unicode and the implementation of the client/server communication.
The introduction of Unicode allowed us to simplify existing C code on the server side because only 2-byte code had to be adapted to Unicode and the old one for 1-byte had just to be thrown away. The universality of the tools integrated in the server is a new feature implied by the use of Unicode: it is a valuable investment for the future, when we shall deal with Korean, another language that we intend to deal with in our research.
Another benefit of rewriting our application as a client/server application with Java is that all input methods on the client side now support Unicode. As a consequence, the interface has been alleviated of the task of managing character inputting and partly layout. Hence, the management of a new language like Korean for input/output does not necessitate a recompilation of the client code.
|When the world wants to talk, it speaks Unicode|
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.
22 Jun 2001, Webmaster