Re: Tamil Collation

From: Richard Wordingham (
Date: Sun Jun 26 2005 - 18:35:01 CDT

  • Next message: David Starner: "Re: Tamil sha (U+0BB6) - deprecate it?"

    Sinnathurai Srivas wrote:

    > Why punishing Tamil for mistakes in Grantham and Unicode?
    >> 0BCA ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
    >> 0BC6 0BBE ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
    >> Note that the sorting algorithm will treat them as identical.
    >> A similar entry for 'ksh' would start '0B95 0BCD 0BB7'.
    > Tamil can process itself at 16 bit (and 8bit)

    This is 16 bit processing! The part of the key for Level 1 comparison gets
    0x197B, the part for Level 2 (basically accent comparison) gets 0x002, the
    part for Level 3 (casing etc.) gets 0x002, and the part for Level 4, which
    ensures that canonically inequivalent sequences do not compare equal, gets

    > Why this punishment by Grantham. ksh forces Tamil to go even the way of 48
    > bit way.

    It doesn't. The start of the 'ksh' entry is sequence of 3 scalar values,
    those of KA, VIRAMA, SSA. The punishment is actually for sharing a planet
    with Europeans - capitals and accents. (You can only blame Thais for tone
    marks, which are treated like accents. I'm not sure that Thai tone marks
    weren't based on Vedic accents.)

    > Please find ways to stop this nonsense.

    Did you try to read the Unicode Collation Algorithm?

    > Tamil do not need all these unwanted punishment. We are innocent please.
    > Lets do 16 bit processing. let's stop un-technical canonism.
    > Let's stop vastly complex ksh running havoc with Tamil.

    >>>> If Tamil sorting can be expressed purely by a sorting order of
    >>>> consonants
    >>>> and vowels, then the answer for sorting words is simply to rearrange
    >>>> the
    >>>> weights on vowels and letters in the default UCA to accord with this
    >> .> ordering.
    >>> 99% yes.
    >>> Simply, the pulli (virama!), the dependent vowels, vowels and Aytham
    >>> need to be weighted and that's it.

    That's not true, as you should know full well. The usual Indic alphabet
    ends, gathering bits and pieces, YA, RA, LA, VA, SHA, SSA, SA, HA. Tamil
    needed to add NNNA, RRA, LLA and LLLA, and unfortunately modern(?)
    Devanagari has added them in a different order to Tamil. The default UCA
    orders the consonants in codepoint order, and then to add to the
    disagreement Tamil puts the 'Grantha' letters together (so moving JA) and
    adds 'ksh'. I believe the basic information may be found in Table 1 at . Good news
    is that the ஸ்ரீ ('shri')
    ligature is sorted specially, so collation can reasonably be defined to make
    the old and new encodings equivalent!

    The basic changes needed are to change the weights of the consonants. We
    need some extra values - how does one express that in a proposal to change
    the default algorithm? For thinking about it, we can use fractional values.

    One nasty feature to implement is that consonant plus pulli comes before
    plain consonant. The simplest way of capturing this is to change consonant
    entries in the weighting table such as that for KA from

    0B95 ; [.195C.0020.0002.0B95] # TAMIL LETTER KA


    0B95 ; [.195C.0020.0002.0B95][.197E.0020.0002.0BCD] # TAMIL LETTER KA
    0B95 0BCD ; [.195C.0020.0002.0B95] # <TAMIL LETTER KA, TAMIL SIGN VIRAMA>

    while retaining

    0BCD ; [.197E.0020.0002.0BCD] # TAMIL SIGN VIRAMA

    for pulli used inappropriately.

    This trick effectively replaces TAMIL SIGN VIRAMA by 'TAMIL SIGN NO VIRAMA'.

    It's a tad unpleasant in that it lengthens most sort keys. Another solution
    is to have an entirely separate weight for consonant plus pulli, e.g.

    0B95 ; [.195CH.0020.0002.0B95] # TAMIL LETTER KA
    0B95 0BCD ; [.195C.0020.0002.0B95] # <TAMIL LETTER KA, TAMIL SIGN VIRAMA>

    where H means a half. (I really am hitting notational problems here.

    There are other details to check, but I hope everyone interested understands
    roughly what needs doing.


    This archive was generated by hypermail 2.1.5 : Sun Jun 26 2005 - 18:36:27 CDT