Re: Tamil Collation

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Mon Jun 27 2005 - 16:34:16 CDT

Next message: Gregg Reynolds: "[Fwd: Re: Tamil Collation vs Transliteration/Transcription Enc Version2]"

Previous message: John Hudson: "Re: Tamil Collation vs Transliteration/Transcription Enc Version2"
Maybe in reply to: Richard Wordingham: "Re: Tamil Collation"
Next in thread: Sinnathurai Srivas: "Tamil Collation - Analysis"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

N. Ganesan wrote:

>Pl. see a collation chart for Tamil:
> http://nganesan.thamizamuthu.com/docs/TamilCollationChart.html
> Or, in pdf form:
> thamizh@sbcglobal.net/TamilCollationChart.pdf">thamizh@sbcglobal.net/TamilCollationChart.pdf">http://www.geocities.com/thamizh@sbcglobal.net/TamilCollationChart.pdf
ie.
http://www.geocities.com/thamizh[AT]sbcglobal.net/TamilCollationChart.pdf

> I'd love to know when will the SHA (u+0bb6) Uniscribe be updated and SHA
> will work in Windows correctly? Fixing Uniscribe to render SHA series in
> Tamil script - is it a job to be done by companies like Microsoft?

Uniscribe belongs to Microsoft, and I haven't heard of anyone offering an
alternative version.

> Like Thai, Tamil also employs in majority, and in a wide class of
> applications (eg., loans from English, the West or Islamic world) "ksh"
> only as non-conjunct. So we at INFITT are discussing a proposal to make
> the non-conjunct KSHA as default, and to create conjugated ksha with ZWJ.
> The majority behaviour of ksha as non-conjunct is in Tamil, but the
> non-conjunct ksha is not known in other Indic scripts. It is a Tamil
> special.

As far as I can make out, and FWIW Uniscribe agrees with me, both ZWJ and
ZWNJ specify the form with visible pulli. Are க்ஷ் and க்‌ஷ் sorted
differently, as your link implies? If so is க்‌ஷ் truly sorted differently
to what one might expect of a mere sequence of க்‌ and ஷ்?

Working from http://www.infitt.org/minmanjari/issue2_2/mm-unicodetngovt.html
, I thought I had sorted out the requirement and solution:

1. Tamil standard

Collating order is:

A. ASCII: SP ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @
[ \ ] ^ _ { } ~
B. Miscellaneous marks DAY (U+0BF3) to Number sign (U+0BFA)
Current Level 1 weights: *03AD (day) to *03B3 (number sign)

C. Numbers (incl 10, 100 etc)
Current Level 1 weights: 0F62 (0) to 0F6B (9)
but then *0EC9 (10), *0ECA (100), *0ECB (1000)

D. Words:
   Anusvara - current Levels 1 and 2 weights: [0000.0120]
   Aytham - 194F
   Vowel letters - Current Level 1 weights 1950 to 195B
   Consonant letters and vowel signs - in binary order, current Level 1
weights 195C to 197D
   Pulli - Current Level 1 weight:197E
   Stray length mark - Current Level 1 weight 197F

Solution Approach:

1. Treatment of ASCII must be reserved to full Tamil customisation.

2. Query ignoring of the miscellaneous marks.

3. Query treatment and ordering of powers of 10. Why are they treated as
variable?
Why sorted before decimal digits if selected as non-ignorable?

4. Words:
   a) Leaving as at present probably does least harm.
   b) Assign weights in the following ascending sequence:
      (i) For each (NFC) vowel letter in binary order U+0B85 to U+0B94.
      (ii) Aytham (U+0B83)
      (iii) For each consonant and ligature KSHA, in order
           KA, NGA, CA, NYA, TTA, NNA, TA, NA, PA, MA, YA, RA, LA, VA;
(Indian Sprachbund sounds, in standard Indic order)
           LLLA, LLA, RRA, NNNA; (specifically Dravidian sounds)
           JA, SHA, SSA, SA, HA, KSHA ('Grantha' letters, in standard Indic
order):
           (A) Consonant plus virama (i.e. visible pulli)
           (B) Consonant
      (iv) SHRI ligature (whether spelt with SSA or SHA - possibly make
difference a second level matter)
      (v) For each (NFC) dependent vowel sign in binary order U+0BBE to
U+0BCC
      (vi) Virama (for irregular spellings only)
      (vii) Tamil AU length mark (for irregular spellings only)

If K-SHA and KSHA are as complicated as implied by
http://nganesan.thamizamuthu.com/docs/TamilCollationChart.html I'll have to
do some thinking. Are the differences at Level1 or Level 2? It's a shame
that the rendering for the HTML version is broken - the KSHA ligature did
not form! (I'm not totally sold on the idea that Tamil letters are
soft-dotted, that TAMIL VOWEL SIGN A ought to have been an invisible
superscript, and that Tamil vowel signs are all superscript. :) If ZWJ
ought to yield rather than inhibit ligation, the 'contractions' for KSHA
will have to include sequences with ZWJ.

The next step should be to code up and run a revised set of collation
elements (allkeys.txt), but I don't have a Tamil dictionary to test the
collation against.

I can't decide whether it is right to ignore non-decimal numbers in
collation (until Level 4). That rule seems to apply to all but Greek, Roman
and CJK numbers. I don't know enough about Tamil non-positional number
notation to comment.

Richard.

Next message: Gregg Reynolds: "[Fwd: Re: Tamil Collation vs Transliteration/Transcription Enc Version2]"
Previous message: John Hudson: "Re: Tamil Collation vs Transliteration/Transcription Enc Version2"
Maybe in reply to: Richard Wordingham: "Re: Tamil Collation"
Next in thread: Sinnathurai Srivas: "Tamil Collation - Analysis"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jun 27 2005 - 17:06:31 CDT