Fifteenth International Unicode Conference

Target Audience: Software Engineer, Systems Analyst

Level of Session: Intermediate

Our web site recently underwent a transformation from one to 11 languages including Japanese, Chinese and Korean. Since the Microsoft Index Server supported only 7 languages, we decided to create our own multilingual search engine to use on our site. Within a few months, the engine was online, working on most browsers, and supporting all of the desired languages. With minor changes to the code, several more languages could be added.

We found a consistent way to retrieve information across all languages using Unicode and N-Gram matching techniques. The way browsers are handling code set today is rather undocumented. Applying a combination of HTML and standard code set detection algorithms, we are able to determine the precise encoding used by the browser to submit the query.

This presentation will detail idiosyncrasies behind HTML forms, data manipulation and code set handling. We will also show how Unicode plays a critical role for multilingual information retrieval.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

14 Jun 1999, Webmaster