A Unified Phonemic Code Based Scheme for Effective Processing of Indian Languages

R.K. Joshi - National Centre for Software Technology

Intended Audience: Software Engineers, Font Developers
Session Level: Intermediate, Advanced

Statement of Purpose:

The primary purpose is to present a unified phonemic code based technique that can be used to support Indian languages in a variety of applications. The complexity of processing Indian languages is first presented. This is followed by a detailed exposition of this phoneme basis, its ancient historical background, applicability to Unicode and ISCII, a unified technique for Indian language processing tasks, with specific examples from a shaping/rendering engine using OpenType fonts. Finally experience from varied applications is briefly discussed.

Brief Description:

The multitude of Indian languages and dialects are written using 9 scripts. While these scripts have been allotted distinct code pages in the Unicode scheme, applications supporting Indian languages are yet to be found on a number of standard platforms. One primary reason could be the fact that rendering, and processing in general, of Indian languages is complex and mandates distinctly different techniques. Orthograpy follows a phonetically driven basis of compositing "phonetic units" to form complex glyphs. While the character set is compact, authentic rendering implies a generative mechanism that can produce glyphs corresponding to all possible character sequences. Complex as it may seem, clear rules can be defined based on a canonical treatise by Panini, the ancient grammarian. These rules establish a perfect correspondence between phonemes constituting a syllable and its graphical form. And such rules can be defined for each of the Indic scripts. Decomposing text using this phonemic basis, followed by phoneme based computations provides a single unified technique for rendering Indic scripts. In fact, it is well suited even for other processing tasks such as sorting, searching, speech synthesis, speech recognition, transliteration, etc.


The software complexity of supporting Indian languages in different applications can be controlled by the use of a unified technique based on phonemic codes obtained from a well defined transformation of Unicode or ISCII encoding. This is amply illustrated by actual implementation experiences.

