Target Audience: Software Engineer, Systems Analyst
Level of Session: Intermediate
Our web site recently underwent a transformation from one to 11 languages including Japanese, Chinese and Korean. Since the Microsoft Index Server supported only 7 languages, we decided to create our own multilingual search engine to use on our site. Within a few months, the engine was online, working on most browsers, and supporting all of the desired languages. With minor changes to the code, several more languages could be added.
We found a consistent way to retrieve information across all languages using Unicode and N-Gram matching techniques. The way browsers are handling code set today is rather undocumented. Applying a combination of HTML and standard code set detection algorithms, we are able to determine the precise encoding used by the browser to submit the query.
This presentation will detail idiosyncrasies behind HTML forms, data manipulation and code set handling. We will also show how Unicode plays a critical role for multilingual information retrieval.
|When the world wants to talk, it speaks Unicode|
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.
14 Jun 1999, Webmaster