Eighteenth International Unicode Conference

Multilingual Collation in Two Middle Eastern Script Families

Elaine Renee Keown - Independent Researcher in Computational Semitics, Philadelphia

Intended Audience:	Manager, Software Engineer, Systems Analyst, Academics who work with CALL (computer-assisted language learning) or with on-campus software development
Session Level:	Intermediate

Statement of Purpose:

This paper explains the collating (alphabetic sorting) properties of two historically related Middle Eastern script families:

the Perso-Arabic script family
Hebrew-Aramaic square script, usually called the "Hebrew alphabet" today

Both Hebrew and Arabic scripts were used to write many languages during the last 2900 years. Both scripts descend directly from the newly discovered 3rd millennium B.C. alphabetic inscriptions in Egypt's Western Desert. In this paper we focus on how Hebrew-Aramaic and Perso-Arabic scripts collate differently from scripts descended via Greek, especially when used in a multilingual situation.

Paper Description:

Alphabets descended through Greek, such as Roman and Cyrillic, usually developed capital letters during the course of their script history. Therefore, for computer collation they always need a collation table to interweave the lower-case and upper-case letters. Such a collation table slows down database software by about 50%.

However, Perso-Arabic and Hebrew-Aramaic never developed capitals. For multilingual Hebrew-Aramaic, a collation table is not necessary. The two dozen alphabet variants for Hebrew-Aramaic can be interwoven in one chain that provides separate subcollations for each different language. However, for Perso-Arabic, which has a more complex script history and is used to write over 100 languages, a collation table is needed for the end of the alphabet. Perso-Arabic script developed in two directions, one based directly on Arabic and one on the Persian Arabic script. Languages such as Urdu, Panjabi, Pashto, Sindhi, and Siraiki collate multilingually with the Persian Arabic script.

Conclusions:

Future versions of Unicode will include more variant letters for both Hebrew-Aramaic and Perso-Arabic script families. Languages using Perso-Arabic script are spoken by 400 million people in countries that are more computerized every month. The special multilingual collation found in these two script families should be built into future versions of Unicode-compatible algorithms.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

12 December 2000, Webmaster