Twenty-first International Unicode Conference

Designing a Farsi/English Unicode-based Search Engine

Mohammad Azadnia, Maziar Salehi & Ali Mohammad Zareh Bidoki - Iran Telecommunication Research Center (ITRC)

Intended Audience:	Managers, Software Engineers, Systems Analysts
Session Level:	Intermediate, Advanced

In this paper we have tried to design a prototype of Farsi/English search engine. It has the duty of covering the web features such as heterogeneity, volatility and huge amount of unstructured information. These features as well as the rapid advance in technology, challenge the classical Information Retrieval (IR) techniques.

Although a growing number of Farsi-supported sites exist, still few research works have been done regarding the encoding and indexing of Farsi texts. It seems that Unicode is sufficiently capable of preparing a conclusive environment within this respect specially regarding to indexing web pages, however Many common Farsi code-pages have to be converted into Unicode, in order to cover most of the existing sites.

We utilized Unified Modeling Language (UML) to generate a visual easy-to-scale model, and to assure scalability and reliability, we used Clustering techniques and RAID. We've tried to apply Common Object Request Broker Architecture (CORBA) due to distributed object-oriented design and our agent-oriented trends in the system.

Keywords: Unicode, UML, CORBA, Information Retrieval, Search engine, Farsi language, clustering

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

19 January 2002, Webmaster