Re: Tamil Collation

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sun Jun 26 2005 - 18:35:01 CDT

Next message: David Starner: "Re: Tamil sha (U+0BB6) - deprecate it?"

Previous message: David Starner: "Re: Tamil Collation vs Transliteration/Transcription Encoding"
In reply to: Sinnathurai Srivas: "Re: Tamil Collation vs Transliteration/Transcription Enc Version2"
Next in thread: James Kass: "Re: Tamil Collation"
Maybe reply: James Kass: "Re: Tamil Collation"
Maybe reply: Richard Wordingham: "Re: Tamil Collation"
Reply: Sinnathurai Srivas: "Tamil Collation - Analysis"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Sinnathurai Srivas wrote:

> Why punishing Tamil for mistakes in Grantham and Unicode?
>
>> 0BCA ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
>> 0BC6 0BBE ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
>>
>> Note that the sorting algorithm will treat them as identical.
>>
>> A similar entry for 'ksh' would start '0B95 0BCD 0BB7'.
>
> Tamil can process itself at 16 bit (and 8bit)

This is 16 bit processing! The part of the key for Level 1 comparison gets
0x197B, the part for Level 2 (basically accent comparison) gets 0x002, the
part for Level 3 (casing etc.) gets 0x002, and the part for Level 4, which
ensures that canonically inequivalent sequences do not compare equal, gets
0xBCA.

> Why this punishment by Grantham. ksh forces Tamil to go even the way of 48
> bit way.

It doesn't. The start of the 'ksh' entry is sequence of 3 scalar values,
those of KA, VIRAMA, SSA. The punishment is actually for sharing a planet
with Europeans - capitals and accents. (You can only blame Thais for tone
marks, which are treated like accents. I'm not sure that Thai tone marks
weren't based on Vedic accents.)

> Please find ways to stop this nonsense.

Did you try to read the Unicode Collation Algorithm?

> Tamil do not need all these unwanted punishment. We are innocent please.
>
> Lets do 16 bit processing. let's stop un-technical canonism.
> Let's stop vastly complex ksh running havoc with Tamil.

>>>> If Tamil sorting can be expressed purely by a sorting order of
>>>> consonants
>>>> and vowels, then the answer for sorting words is simply to rearrange
>>>> the
>>>> weights on vowels and letters in the default UCA to accord with this
>> .> ordering.
>>
>>> 99% yes.
>>
>>> Simply, the pulli (virama!), the dependent vowels, vowels and Aytham
>>> need to be weighted and that's it.

That's not true, as you should know full well. The usual Indic alphabet
ends, gathering bits and pieces, YA, RA, LA, VA, SHA, SSA, SA, HA. Tamil
needed to add NNNA, RRA, LLA and LLLA, and unfortunately modern(?)
Devanagari has added them in a different order to Tamil. The default UCA
orders the consonants in codepoint order, and then to add to the
disagreement Tamil puts the 'Grantha' letters together (so moving JA) and
adds 'ksh'. I believe the basic information may be found in Table 1 at
http://www.infitt.org/minmanjari/issue2_2/mm-unicodetngovt.html . Good news
is that the ஸ்ரீ ('shri')
ligature is sorted specially, so collation can reasonably be defined to make
the old and new encodings equivalent!

The basic changes needed are to change the weights of the consonants. We
need some extra values - how does one express that in a proposal to change
the default algorithm? For thinking about it, we can use fractional values.

One nasty feature to implement is that consonant plus pulli comes before
plain consonant. The simplest way of capturing this is to change consonant
entries in the weighting table such as that for KA from

0B95 ; [.195C.0020.0002.0B95] # TAMIL LETTER KA

0B95 ; [.195C.0020.0002.0B95][.197E.0020.0002.0BCD] # TAMIL LETTER KA
0B95 0BCD ; [.195C.0020.0002.0B95] # <TAMIL LETTER KA, TAMIL SIGN VIRAMA>

while retaining

0BCD ; [.197E.0020.0002.0BCD] # TAMIL SIGN VIRAMA

for pulli used inappropriately.

This trick effectively replaces TAMIL SIGN VIRAMA by 'TAMIL SIGN NO VIRAMA'.

It's a tad unpleasant in that it lengthens most sort keys. Another solution
is to have an entirely separate weight for consonant plus pulli, e.g.

0B95 ; [.195CH.0020.0002.0B95] # TAMIL LETTER KA
0B95 0BCD ; [.195C.0020.0002.0B95] # <TAMIL LETTER KA, TAMIL SIGN VIRAMA>

where H means a half. (I really am hitting notational problems here.
Help!)

There are other details to check, but I hope everyone interested understands
roughly what needs doing.

Richard.

Next message: David Starner: "Re: Tamil sha (U+0BB6) - deprecate it?"
Previous message: David Starner: "Re: Tamil Collation vs Transliteration/Transcription Encoding"
In reply to: Sinnathurai Srivas: "Re: Tamil Collation vs Transliteration/Transcription Enc Version2"
Next in thread: James Kass: "Re: Tamil Collation"
Maybe reply: James Kass: "Re: Tamil Collation"
Maybe reply: Richard Wordingham: "Re: Tamil Collation"
Reply: Sinnathurai Srivas: "Tamil Collation - Analysis"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jun 26 2005 - 18:36:27 CDT