Seventeenth International Unicode Conference

A New Algorithm for Contextual Analysis of Farsi Characters and Its Implementation in Java

Kourosh Fallah Moshfeghi - Iran Telecommunication Research Center (ITRC)

Intended Audience:	Software Engineer, Systems Analyst
Session Level:	Intermediate, Advanced

In this paper, we introduce a new algorithm for contextual analysis of Farsi characters. With a little change, This algorithm can be used for Arabic characters too. In spite of the other algorithms for this subject, This one is presented in the form of a state machine. Preserving the state results in less processing for each character.

By "contextual analysis", we mean the determination of a character's proper presentation form according to its context. Here "context" means the state of characters surrounding the desired character. In this paper, after a short introduction to contextual analysis and Unicode, we present our new algorithm.

Since, as far as we know, there is no known criteria for evaluating contextual analysis algorithms; we also suggest criteria for evaluation of contextual analysis algorithms and their implementations. After that we briefly describe existing algorithms for contextual analysis of Arabic (and Farsi) characters and their implementations; and then we present the new algorithm (using an ASM chart) and examine various aspects of it.

Following the new algorithm description, we present an implementation of it in Java for the Bilingual Iran Telecommunication Research Center editor (BIT) and give new points on implementing contextual analysis algorithms in word processing applications.

Finally, we compare the performance of our algorithm with the known algorithms and find that this new algorithm has attractive advantages. This includes: being in the form of a state machine, needing less tables with smaller sizes, shorter sliding window length and shorter decision length. The meaning of the last two criteria are described in the paper. We end the paper with a brief conclusion.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

20 Jun 2000, Webmaster