Sixteenth International Unicode Conference

TUSTEP and Culturally Correct Searching in Multilingual Corpora on the Web

Marc Wilhelm Küster - Zentrum für Datenverarbeitung der Universität Tübingen

Intended Audience:	Software Engineer, Project Manager
Session Level:	Intermediate

Statement of purpose:

- Show the need for a consistent internationalization strategy for Web search engines which work on multilingual corpora;
- Demonstrate sample implementations using the TUSTEP toolbox and outline implementation strategies;
- Show on the fly conversion to Unicode from the local storage format to output on the Web.

Abstract

TUSTEP is a toolbox with building blocks which allow flexible handling of structured texts, especially XML-conformant texts. It started its life in the late 60s and has been enhanced ever since. For some 25 years it is used both in commercial and in scholarly applications.

Building blocks include amongst others:
- Professional typesetting facilities (user base in the publishing industries; several prices for excellence in typography) with strong support for multiscript texts;
- 14651-conformant collation tools;
- Database support for XML databases. Access of these databases via CGI, using TUSTEP's efficient and powerful script language.

The paper will concentrate on the implementation of search engines. It will outline general strategies and desiderata for intelligent fuzzy searching on multilingual corpora (cf. also my report for the European Commission on European requirements in the field of browsing and matching at http://www.stri.is/TC304/Matching). These requirements include dealing with local fallback rules (e. g. Þ [Thorn]/ Th), transliteration, phonetically aware searching, legacy character sets, ordering expectation, etc.

The paper will then proceed to demonstrate exemplary implementations in TUSTEP and explain the decisions behind them. It will be shown how data -- including data in non-Latin scripts -- is converted on-the-fly to UTF-8 for presentation on the web. The same data pool can conveniently be used for high-quality database publishing.

The talk will be divided approximately equally between general strategies and concrete implementations and should therefore be of interest to both programmers and project managers.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

5 December 1999, Webmaster