Tamil Collation - Analysis

From: Sinnathurai Srivas (sisrivas@blueyonder.co.uk)
Date: Tue Jun 28 2005 - 14:33:43 CDT

  • Next message: Sinnathurai Srivas: "Re: A Tamil-Roman transliterator (Unicode)"

    Tamil Nadu state government collation table
    is the sort order we need to acheieve, (as primary/default sort order).

    If we do not have to think of future, if we do not have to take count of
    infrequent usage,
    then there is a very simple solution.
    Thai is

    first sort Independent vowels (அ ஆ இ ஈ உ ஊ எ ஏ ஐ ஒ ஓ ஔ)
    then sort aytham (ஃ)
    then sort pulli (்)
    then sort consonant-a (க ங ச ஞ ட ண த ந ப ம ய ர ல வ ழ ள ற ன)
    then sort dependent vowel (ா ி ீ ு ூ ெ ே ை ொ ோ ௌ)

    Typical results would be as follows. (If you wish to vie in a text file with
    linear display, please use aAvarangal font (aAvarangal2 is slightly
    different). One do not need to understand nor concern about fully rendered
    display. A linear display is more than enough for development purposes, it
    is easy to understand and easy to test the software.)

    sample 1

    sample 2

    However followings need to be considered.
    To be continued ...

    சின்னத்துரை சிறீவாஸ்

    ----- Original Message -----
    From: "Richard Wordingham" <richard.wordingham@ntlworld.com>
    To: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>; <unicode@unicode.org>
    Sent: Monday, June 27, 2005 12:35 AM
    Subject: Re: Tamil Collation

    > Sinnathurai Srivas wrote:
    >> Why punishing Tamil for mistakes in Grantham and Unicode?
    >>> 0BCA ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
    >>> 0BC6 0BBE ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
    >>> Note that the sorting algorithm will treat them as identical.
    >>> A similar entry for 'ksh' would start '0B95 0BCD 0BB7'.
    >> Tamil can process itself at 16 bit (and 8bit)
    > This is 16 bit processing! The part of the key for Level 1 comparison
    > gets 0x197B, the part for Level 2 (basically accent comparison) gets
    > 0x002, the part for Level 3 (casing etc.) gets 0x002, and the part for
    > Level 4, which ensures that canonically inequivalent sequences do not
    > compare equal, gets 0xBCA.
    >> Why this punishment by Grantham. ksh forces Tamil to go even the way of
    >> 48 bit way.
    > It doesn't. The start of the 'ksh' entry is sequence of 3 scalar values,
    > those of KA, VIRAMA, SSA. The punishment is actually for sharing a
    > planet with Europeans - capitals and accents. (You can only blame Thais
    > for tone marks, which are treated like accents. I'm not sure that Thai
    > tone marks weren't based on Vedic accents.)
    >> Please find ways to stop this nonsense.
    > Did you try to read the Unicode Collation Algorithm?
    >> Tamil do not need all these unwanted punishment. We are innocent please.
    >> Lets do 16 bit processing. let's stop un-technical canonism.
    >> Let's stop vastly complex ksh running havoc with Tamil.
    >>>>> If Tamil sorting can be expressed purely by a sorting order of
    >>>>> consonants
    >>>>> and vowels, then the answer for sorting words is simply to rearrange
    >>>>> the
    >>>>> weights on vowels and letters in the default UCA to accord with this
    >>> .> ordering.
    >>>> 99% yes.
    >>>> Simply, the pulli (virama!), the dependent vowels, vowels and Aytham
    >>>> need to be weighted and that's it.
    > That's not true, as you should know full well. The usual Indic alphabet
    > ends, gathering bits and pieces, YA, RA, LA, VA, SHA, SSA, SA, HA. Tamil
    > needed to add NNNA, RRA, LLA and LLLA, and unfortunately modern(?)
    > Devanagari has added them in a different order to Tamil. The default UCA
    > orders the consonants in codepoint order, and then to add to the
    > disagreement Tamil puts the 'Grantha' letters together (so moving JA) and
    > adds 'ksh'. I believe the basic information may be found in Table 1 at
    > http://www.infitt.org/minmanjari/issue2_2/mm-unicodetngovt.html . Good
    > news is that the ஸ்ரீ ('shri')
    > ligature is sorted specially, so collation can reasonably be defined to
    > make the old and new encodings equivalent!
    > The basic changes needed are to change the weights of the consonants. We
    > need some extra values - how does one express that in a proposal to change
    > the default algorithm? For thinking about it, we can use fractional
    > values.
    > One nasty feature to implement is that consonant plus pulli comes before
    > plain consonant. The simplest way of capturing this is to change
    > consonant entries in the weighting table such as that for KA from
    > 0B95 ; [.195C.0020.0002.0B95] # TAMIL LETTER KA
    > to
    > 0B95 ; [.195C.0020.0002.0B95][.197E.0020.0002.0BCD] # TAMIL LETTER KA
    > 0B95 0BCD ; [.195C.0020.0002.0B95] # <TAMIL LETTER KA, TAMIL SIGN VIRAMA>
    > while retaining
    > 0BCD ; [.197E.0020.0002.0BCD] # TAMIL SIGN VIRAMA
    > for pulli used inappropriately.
    > This trick effectively replaces TAMIL SIGN VIRAMA by 'TAMIL SIGN NO
    > VIRAMA'.
    > It's a tad unpleasant in that it lengthens most sort keys. Another
    > solution is to have an entirely separate weight for consonant plus pulli,
    > e.g.
    > 0B95 ; [.195CH.0020.0002.0B95] # TAMIL LETTER KA
    > 0B95 0BCD ; [.195C.0020.0002.0B95] # <TAMIL LETTER KA, TAMIL SIGN VIRAMA>
    > where H means a half. (I really am hitting notational problems here.
    > Help!)
    > There are other details to check, but I hope everyone interested
    > understands roughly what needs doing.
    > Richard.

    This archive was generated by hypermail 2.1.5 : Tue Jun 28 2005 - 16:26:49 CDT