UnicodeIUC14
Abstract

The Unicode Retrieval System Architecture (URSA) is a fully Unicode-based retrieval engine for UNIX systems. Unicode is instrumental in tokenizing multilingual texts for retrieval purposes, and serves as the common intermediary representation for queries and documents in URSA. In this presentation, I will illustrate the role of Unicode in the text processing pipeline involved in parsing, tokenizing and indexing documents. I will show how language-dependent issues (Chinese segmentation, Korean morphology) interact with character-set issues (whitespace determination) in a high-performance indexing and retrieval engine capable of greater than 400 Mb/hour indexing speeds that result in indexes of around 20% the size of the original text collection. I will also show two real-time demonstrations of the URSA engine in use. The first demonstration will show a visualization system for examining the results of a retrieval that departs significantly from standard summary-based approaches to ranked results. The second demonstration will show how an interactive cross-language or "translingual" retrieval system can take advantage of Unicode support to help non-bilingual personnel effectively navigate, query and retrieve documents in foreign languages. In conclusion, I will describe several related projects that are using URSA libraries in developing multilingual text-processing applications that extend the architecture beyond simple document retrieval, and which demonstrate the versatility of full Unicode support in a retrieval architecture.

Unicode
When the world wants to talk, it speaks Unicode
ProgramShowcasePast ConferencesRegistrationUnicode StandardCall for Papers
AccommodationSponsorsTalks and PapersTravelConference BoardNext Conference
UnicodeIUC14
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

24 January 1999, Webmaster