TUSTEP and Culturally Correct Searching in Multilingual Corpora on the Web
Marc Wilhelm Küster - Zentrum für Datenverarbeitung der Universität Tübingen
Statement of purpose:
- Show the need for a consistent internationalization strategy for
Web search engines which work on multilingual corpora;
TUSTEP is a toolbox with building blocks which allow flexible handling of structured texts, especially XML-conformant texts. It started its life in the late 60s and has been enhanced ever since. For some 25 years it is used both in commercial and in scholarly applications.
Building blocks include amongst others:
The paper will concentrate on the implementation of search engines. It will outline general strategies and desiderata for intelligent fuzzy searching on multilingual corpora (cf. also my report for the European Commission on European requirements in the field of browsing and matching at http://www.stri.is/TC304/Matching). These requirements include dealing with local fallback rules (e. g. Þ [Thorn]/ Th), transliteration, phonetically aware searching, legacy character sets, ordering expectation, etc.
The paper will then proceed to demonstrate exemplary implementations in TUSTEP and explain the decisions behind them. It will be shown how data -- including data in non-Latin scripts -- is converted on-the-fly to UTF-8 for presentation on the web. The same data pool can conveniently be used for high-quality database publishing.
The talk will be divided approximately equally between general strategies and concrete implementations and should therefore be of interest to both programmers and project managers.
|When the world wants to talk, it speaks Unicode|
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.
5 December 1999, Webmaster