From: Sinnathurai Srivas (sisrivas@blueyonder.co.uk)
Date: Fri Jun 24 2005 - 14:06:40 CDT
(Draft version)
Tamil Collation vs Transliteration/Transcription Encodinng
Though it undergoes numerous implementation problems, Unicode is based on a
highly sophisticated technical architecture. In this article how Unicode
mishandled Tamil collation and analyses the alternative solutions to attain
Tamil Collation.
Any implementation would initially attempt for a natural sort order for a
language, where by the default hex order of codes would be a natural sort
order of that language. The question now is why Unicode decided to deny this
natural facility to Tamil, in its implementation strategy. The answer is, in
Unicode's consideration there is another requirement that was considered
more important than sorting order of Tamil. The requirement was, the
transliteration properties of code order of all Indian languages must be the
same and sort order was considered a minute matter in comparison to sort
order. Unicode decided that writing softwares to transliterate between
different Indic languages is a more daunting task than writing software to
collate a language.
However, Devanagari had it's upper hand in getting it natural sort order
encoded, while other languages were forced to abandon the natural sort order
in favour of transliteration code order. All these other languages now face
the task of implementing fixes to get the collation working.
Unlike Latin based languages, each Indic languages use alphabet of their
own. For this reason abandoning natural sort order in favour of
transliteration sort order was not a technical but a political decision by
Unicode. Unicode did understand the damage it made to the suffering
languages, but decided to go along with it's political decision, forcing
minority languages to obey orders. Software routines to do transliteration
is a simple task, compared to software routines to collate a scrambled
encoding. Unicode still decided to enforce its political agenda over a
technical requirement.
Unicode transliteration scheme does not work. The saddest thing of all is
that the transliteration does not work as Unicode hoped it. There never was
a simple transliteration mechanism suitable for encoding different
languages. For example, Tamil writing system is based on phonemic based
Alphabet system, while Devanagari is based on phonemic only system. In Tamil
k = k, h, g, x, q, c (mahaL, magan, makkan, quil, xavier, etc..). In
Devanagari individual glyph shapes represent each of these phonemes. In
Tamil aspirated and many other sounds are written using a single modulating
indicator called Aytham, yet an unacceptably high number of code points
allocated for Tamil is deprecated and made unusable because of this
transliteration encoding that never works.
It is important to understand that a superior architecture like Unicode,
made inferior by misguided political requirement is not going to be an easy
task to resolve. There fore it is very important that we start work on
fixing the bug caused by transliteration based encoding to do the collation
as required. We will analyse the collation techniques available to fix the
problem caused by transliteration based encoding bug.
To be continued....
This archive was generated by hypermail 2.1.5 : Fri Jun 24 2005 - 14:09:44 CDT