Re: Clarifications on Thamizh Character Set Standardisations

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Mon May 29 2000 - 09:43:28 EDT


Dear Sir,

Padma kumar .R wrote:
>
> I am having some doubts to be clarified regarding the subject of Thamizh
> character set standardisation. I trust that you will reply me when you find
> free time. I am listing my Ideas, suggestions and doubts in the following.
>
> 1. I noticed that the Thamizh character ordering in both of the
> character-sets are not that much proper.

Well, it may be somewhat wrong, but you certainly have to consult the right
place, and it is *not* the obvious one.

The ordering is governed by UTR#10
<URL:http://www.unicode.org/unicode/reports/tr10/>
Take a look at <URL:http://www.unicode.org/unicode/reports/tr10/charts/>
to have a nice graphical representation (thanks to Mark).
<URL:http://www.unicode.org/unicode/reports/tr10/charts/Collation40.html>
is the relevant page for Tamil (sorry, I write it using the standard
orthography, as many people are not able to grasp தமிழ௃).

That said, I concur to say that Tamil order is not adequate as it stands
(which should look like, using Unicode names, as something like ka nga ca
 nya tta nna ta na pa ma ya ra la va llla lla rra nna, and then the
 Grantha characters).

> If we use any of the character
> encoding for computations auch as alphabetic sorting (akara varisai) or
> index searching, we may not get the desired result.

This is well known, and indeed it has been acknowledged for a long time
that this is not a goal for Unicode. Only local standards, which have
a much narrower aim, _may_ be able to cover that requirements. More broad
standards are not able to achieve the requirements, for multiple reasons.
As a result, for every language, yours and mine included, there is a need
for a special handling to do collations if using Unicode.

> I think it should be like,
> a. Numerals
> b. Vowels & ayutham
> c. Prefix modifiers
> d. Pure consonents (ka, nga, cha...., sa, sha..) (without any
> modifiers)
> e. NNaa, Naa, Raa
> f. Postfix modifiers
> g. Combination modifier symbol (for Ea, Eaa, O, OO, Ow) (not that
> much necessary) (found in ISCII & UNICODE)
> h. Combination Consonents, in alphabetic order
> i. Special characters and punctuations.
>
> If the character sequence is in the above order there will be no problem
> in the computation and word processing point of view.

I think you will still have problems:
- first, in a glyph encoding like TAB, the prefix matras like e ee ai, etc.
have to come before the consonant, but I understand they collate after in
Tamil. Certainly this will lead to problems.
- next, having precomposed clusters like Naa (whatever form of Tamil n it
stands for) is not likely to always work, since it should come between
Na (which is a single "character") and Ni (which is built with two
characters, the first of which being the very one for Na)...

To solve this problem (and others), ISCII chose to use phonetic encoding,
so to use the same character for "aa", being used after Na (thus changing
to a ligature) or after ka. This is very different from glyph encodings
like TAB. Unicode follows ISCII on this respect (making it a basis for
its structure, in fact).

> I wonder, why such a orderly convenson is not followed in both of the
> well analysed Standatrds. I am very much interested to know about this.

For Unicode, this is easy: it closely follows ISCII-88, and this latter have
been designed with the requirement of easy translitteration between all the
Indian scripts, rather than particular suitability to a particular one.
I know that ISCII order appear Devanagari-biased to you (and I do not want
comment on this any further), but these are histories from the past, that we
have to live with. There is *no* options for this to change in the future
(Unicode won't do again the Hangul mess).

 
> Also I am sending you the Thamizh character order (JPG File) as an
> attachment, that what I have in my mind.

I am sure Mark and Ken, the authors of UTR-10, will study your input.
I am unable to represent myself if it departs from my idea of the
order (at first sight, it looks like lla U+0BB3 comes after rra U+0BB1
for you, on row 6, while all my sources have lla U+0BB3 before
rra U+0BB1; for the record, row 20 seems to have the order I know).

Antoine



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT