Re: Tamil Collation

From: Richard Wordingham (
Date: Mon Jun 27 2005 - 16:34:16 CDT

  • Next message: Gregg Reynolds: "[Fwd: Re: Tamil Collation vs Transliteration/Transcription Enc Version2]"

    N. Ganesan wrote:

    >Pl. see a collation chart for Tamil:
    > Or, in pdf form:

    > I'd love to know when will the SHA (u+0bb6) Uniscribe be updated and SHA
    > will work in Windows correctly? Fixing Uniscribe to render SHA series in
    > Tamil script - is it a job to be done by companies like Microsoft?

    Uniscribe belongs to Microsoft, and I haven't heard of anyone offering an
    alternative version.

    > Like Thai, Tamil also employs in majority, and in a wide class of
    > applications (eg., loans from English, the West or Islamic world) "ksh"
    > only as non-conjunct. So we at INFITT are discussing a proposal to make
    > the non-conjunct KSHA as default, and to create conjugated ksha with ZWJ.
    > The majority behaviour of ksha as non-conjunct is in Tamil, but the
    > non-conjunct ksha is not known in other Indic scripts. It is a Tamil
    > special.

    As far as I can make out, and FWIW Uniscribe agrees with me, both ZWJ and
    ZWNJ specify the form with visible pulli. Are க்ஷ் and க்‌ஷ் sorted
    differently, as your link implies? If so is க்‌ஷ் truly sorted differently
    to what one might expect of a mere sequence of க்‌ and ஷ்?

    Working from
    , I thought I had sorted out the requirement and solution:

    1. Tamil standard

    Collating order is:

    A. ASCII: SP ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @
    [ \ ] ^ _ { } ~
    B. Miscellaneous marks DAY (U+0BF3) to Number sign (U+0BFA)
       Current Level 1 weights: *03AD (day) to *03B3 (number sign)

    C. Numbers (incl 10, 100 etc)
       Current Level 1 weights: 0F62 (0) to 0F6B (9)
       but then *0EC9 (10), *0ECA (100), *0ECB (1000)

    D. Words:
       Anusvara - current Levels 1 and 2 weights: [0000.0120]
       Aytham - 194F
       Vowel letters - Current Level 1 weights 1950 to 195B
       Consonant letters and vowel signs - in binary order, current Level 1
    weights 195C to 197D
       Pulli - Current Level 1 weight:197E
       Stray length mark - Current Level 1 weight 197F

    Solution Approach:

    1. Treatment of ASCII must be reserved to full Tamil customisation.

    2. Query ignoring of the miscellaneous marks.

    3. Query treatment and ordering of powers of 10. Why are they treated as
       Why sorted before decimal digits if selected as non-ignorable?

    4. Words:
       a) Leaving as at present probably does least harm.
       b) Assign weights in the following ascending sequence:
          (i) For each (NFC) vowel letter in binary order U+0B85 to U+0B94.
          (ii) Aytham (U+0B83)
          (iii) For each consonant and ligature KSHA, in order
               KA, NGA, CA, NYA, TTA, NNA, TA, NA, PA, MA, YA, RA, LA, VA;
    (Indian Sprachbund sounds, in standard Indic order)
               LLLA, LLA, RRA, NNNA; (specifically Dravidian sounds)
               JA, SHA, SSA, SA, HA, KSHA ('Grantha' letters, in standard Indic
               (A) Consonant plus virama (i.e. visible pulli)
               (B) Consonant
          (iv) SHRI ligature (whether spelt with SSA or SHA - possibly make
    difference a second level matter)
          (v) For each (NFC) dependent vowel sign in binary order U+0BBE to
          (vi) Virama (for irregular spellings only)
          (vii) Tamil AU length mark (for irregular spellings only)

    If K-SHA and KSHA are as complicated as implied by I'll have to
    do some thinking. Are the differences at Level1 or Level 2? It's a shame
    that the rendering for the HTML version is broken - the KSHA ligature did
    not form! (I'm not totally sold on the idea that Tamil letters are
    soft-dotted, that TAMIL VOWEL SIGN A ought to have been an invisible
    superscript, and that Tamil vowel signs are all superscript. :) If ZWJ
    ought to yield rather than inhibit ligation, the 'contractions' for KSHA
    will have to include sequences with ZWJ.

    The next step should be to code up and run a revised set of collation
    elements (allkeys.txt), but I don't have a Tamil dictionary to test the
    collation against.

    I can't decide whether it is right to ignore non-decimal numbers in
    collation (until Level 4). That rule seems to apply to all but Greek, Roman
    and CJK numbers. I don't know enough about Tamil non-positional number
    notation to comment.


    This archive was generated by hypermail 2.1.5 : Mon Jun 27 2005 - 17:06:31 CDT