Unicode for Multilingual Software: An Indian Perspective

Ajit Joshi - C-DAC

Intended Audience:	Software Engineers, Font Designers
Session Level:	Intermediate

Statement of Purpose

The Unicode standard can be used with proper abstraction and refinement to truly internationalize software interfaces for text display, editing and printing.

Paper description

Current software, shaping and font architecture for handling multilingual text has evolved from the ASCII and Latin centric viewpoint. Because of this, functionality that should be standardized and implemented at some layer is found present at another layer. For multilingual text, especially those involving Indic scripts, we find that simpler, and generic architecture results if we 1) have standards for text display 2) character cluster determination and 3) sub cluster formation, repositioning of subclusters and glyph generation from subclusters.

The basic call to display text should take a Unicode string as parameter. It is the responsibility of the underlying system, to choose proper machinery to display the string. There should not be separate function calls for displaying English and other languages.

In IndiX, XServer have been modified the, so that clients can send UTF-8 encoded text in several Indian scripts.
The basic editing interfaces should understand character clusters and the extent of the cluster. The call for getting the extent should take the character cluster as the parameter. Toolkits can build complex layout operations based on this call.

The Unicode standard specifies some characters as combining characters for cluster identification.

In Indix, the call for getextent is routed to and handled by the text to glyph machinery in the X11 server. A toolkit GTK-11 has also been modified and many applications, like Mozilla composer, are working fine.
The third interface implements the mapping to glyphs. A text cluster can still be broken down to generate sub clusters, and each sub cluster generates a glyph. Unicode informally defines character subclusters and their repositioning for Indic scripts. If these are defined formally, implementation dependent choices will be reduced. The upper layers become more general. Upward migration of choices from lower layers, for example the font, make them simpler and more generic.

In IndiX the informal Unicode guidelines, helped us to engineer a generic shaping engine for Indic scripts. Characters within a subcluster are not reordered. Feature flag settings for font machinery too is avoided.

Conclusions

Following the spirit of Unicode standard for software interfaces helps to internationalize and localize wide range of applications, especially to Indic languages. The infrastructure components, like fonts and shaping engines too become simpler.

Who will benefit from attending

Localization engineers, font layout designers, software engineers.

Description of the (business) benefit

A wide range of applications can be localized and infrastructure to support multilingual applications can be engineered on sound and simple principles.