Re: Tamil Collation vs Transliteration/Transcription Enc

From: Sinnathurai Srivas (
Date: Sat Jun 25 2005 - 05:46:07 CDT

  • Next message: Sinnathurai Srivas: "Deprecate Tamil 0bb6"

    ----- Original Message -----
    From: "Michael Everson" <>
    To: "Sinnathurai Srivas" <>
    Cc: "Unicode List" <>
    Sent: Saturday, June 25, 2005 10:10 AM
    Subject: Re: Tamil Collation vs Transliteration/Transcription Enc

    > At 20:06 +0100 2005-06-24, Sinnathurai Srivas wrote:
    >>Any implementation would initially attempt for a natural sort order for a
    >>language, where by the default hex order of codes would be a natural sort
    >>order of that language.
    > This is not how modern sorting of the Unicode Standard works. Except for
    > very simple scripts like Cherokee or Phoenician, hex order can rarely be
    > considered to work -- and it doesn't work ANYWAY the instant you mix
    > European digits or punctuation with them.
    Exactly, because Unicode choose the transliteration based encoding, it now
    has to base it's collation on scheme on something. So I agree that Unicode
    has one or two schemes to handle collation. I ran sort in MS Excel long time
    ago and It worked, if I placed the alphabet in that default sort order (A
    newly encoded, sort based font) and it worked. Indeed, Arabic digits and
    punctuations would need some software support.

    >>The question now is why Unicode decided to deny this natural facility to
    >>Tamil, in its implementation strategy.
    > You assume (1) that there is something "natural" to be served, and (2,
    > again) that Tamil is "broken" somehow in Unicode, which it is not.
    See above ... Ecel experiment

    >>The answer is, in Unicode's consideration there is another requirement
    >>that was considered more important than sorting order of Tamil. The
    >>requirement was, the transliteration properties of code order of all
    >>Indian languages must be the same and sort order was considered a minute
    >>matter in comparison to sort order.
    > It is the case that the Indic blocks (for the major scripts) have
    > one-to-one positional equivalences. This was unnecessary, and wasteful of
    > space -- but it was inherited from ISCII, so you can go and blame them if
    > you don't like it. Having said that, even though it was unnecessary and
    > wasteful of space, it was in no way harmful to any of the Indic scripts.
    I can at least talk to Unicode, I do not think ISCII would have such
    traditions. Any way we are not talking about ISCII Code, we are talking
    about Unicode.

    >>Unicode decided that writing softwares to transliterate between different
    >>Indic languages is a more daunting task than writing software to collate a
    > ISO/IEC 14651 and the Unicode Collation Algorithm can sort anything
    > correctly, so long as the sort is algorithmic.

    Ofcourse, as we did not choose the natural sort order, a fix, like a bug
    fix, need to be deviced to do the sorting.

    >>However, Devanagari had it's upper hand in getting it natural sort order
    > This is inappropriate rhetoric. Devanagari is not a godlike force looking
    > for superiority over Tamil, Redjang, Tibetan, and Lepcha.
    Why the a transliteration based encoding, when it does not work. Do you have
    a plan to change these languages to toe the Devanagari line, some time in
    the long distance.

    >>while other languages were forced to abandon the natural sort order in
    >>favour of transliteration code order.
    > Not only is this unsubstantiated, but it is untrue.
    Devanagari is encoded in sort order form, while others are encoded in
    transliteration form, that does not work.

    >>All these other languages now face the task of implementing fixes to get
    >>the collation working.
    > Languages face no tasks. Implementors of the Unicode Collation Algorithm
    > and ISO/IEC 14651 have to tailor those standards to meet their needs.
    Even after 30 years of existence, implementors could not get the sorting to
    work in Unicode. The natural sort order, which never needed any significant
    inpu by developers was abandoned in favour of transliteration based sort
    order, which never works. We now have sort order not working at least for
    the for seeable future.

    >>Unlike Latin based languages, each Indic languages use alphabet of their
    >>own. For this reason abandoning natural sort order in favour of
    >>transliteration sort order was not a technical but a political decision by
    > Nonsense. (1) The order of the characters in a code table is irrelevant
    > with regard to sorting, and (2) the order of the characters in the Tamil
    > code table follows ISCII.

    Why then is transliteration based encoding.
    Why not natural sorting based encoding. Developers need not doing any
    significant work to get the sorting working with natural sorting.

    >>Unicode did understand the damage it made to the suffering languages, but
    >>decided to go along with it's political decision, forcing minority
    >>languages to obey orders.
    > This allegation is outrageous and entirely untrue.
    Unicode is a transliteration based encoding, except for Devanagari, still
    there is no resemblence in character shapes between languages, which all
    needed their own code points.
    Do you see what sounds outrageous?

    >>Software routines to do transliteration is a simple task, compared to
    >>software routines to collate a scrambled encoding. Unicode still decided
    >>to enforce its political agenda over a technical requirement.
    > Not a thing you have said makes any sense whatsoever -- it is you, sir,
    > who are being political. Nothing you are saying here makes either
    > techncial or linguistic sense. Please read the Unicode Standard.
    Transliteration based encoding make sense.
    Natural sort order based encoding for Devanagari make sense.
    Transliteration based encoding never works for what it is intended for, make
    Who rules who makes sense, and that is political.
    Abandoning natural sort order in favor of toeing Devanagari line that never
    work is a political decision. Because I say that is political, are you
    correct in saying I'm political?

    >>Unicode transliteration scheme does not work.
    > Unicode has no transliteration scheme. Your belief that it does because
    > ISCII had a particular structure in its code tables is mistaken.
    Yes, it does. The Unicode encoding is baed on transliteration scheme and I
    do not think you would ask me to spend any more time with this obvious fact.

    >>The saddest thing of all is that the transliteration does not work as
    >>Unicode hoped it. There never was a simple transliteration mechanism
    >>suitable for encoding different languages. For example, Tamil writing
    >>system is based on phonemic based Alphabet system, while Devanagari is
    >>based on phonemic only system.
    > These terms ("phonemic based alphabet system" and "phonemic only system")
    > are non-technical and inaccurate with regard to the structure of the
    > writing systems.
    Well it is largely accurate.

    >>In Tamil k = k, h, g, x, q, c (mahaL, magan, makkan, quil, xavier, etc..).
    >>In Devanagari individual glyph shapes represent each of these phonemes.
    > Tamil has complex reading rules because it lost original Brahmic letters.
    > So what?
    No sir, Bramic was not their, when initial tamil Grammars were there.
    I do not think you know about Tamil Grammar. The alphabet structre has it
    Grammar rules. I do not think any one in Unicode will ever wanted to read
    about at least the rules on alphabets, but dying to make alphabet based and
    other Gramatical based decisions for Tamil. I know we are power less to stop
    that. At least you could have some considerations towards our traditions,
    instead of trying to change something that you do not know what iot is.

    >>In Tamil aspirated and many other sounds are written using a single
    >>modulating indicator called Aytham, yet an unacceptably high number of
    >>code points allocated for Tamil is deprecated and made unusable because of
    >>this transliteration encoding that never works.
    > If you mean the empty spaces are wasteful, yes they are. They are however
    > not harmful.

    >>It is important to understand that a superior architecture like Unicode,
    >>made inferior by misguided political requirement is not going to be an
    >>easy task to resolve.
    > Gosh, Tamil seems to be implemented here on my Macintosh running OS X.
    > Looks like someone has solved it anyway.
    Are you talking about collation in Mac?

    >>There fore it is very important that we start work on fixing the bug
    >>caused by transliteration based encoding to do the collation as required.
    >>We will analyse the collation techniques available to fix the problem
    >>caused by transliteration based encoding bug.
    > You are not going to get anywhere as long as you are stuck on this idea
    > that Unicode has anything to do with transliteration.
    Well understanding what you got is the best way to resolve what you need.

    it is a translieration based encoding that we have. yes, it is.

    >>To be continued....
    > Write a UTN on Tamil sorting if you think that is necessary. Such a
    > document would be useful, perhaps. But your apparent political agenda
    > about the superiority and uniqueness of Tamil is tiresome at best.
    > --


    It is not superior that i claim, I claim it is sophisticated. There are
    other languages that are very sophisticated in their Grammar too. Tamil has
    a sophisticated Grammar, that is probably the oldest written, but still
    surviving Grammar in the world. It is sophisticated, you need to understand
    it before trying to change it for your needs. I never claim it is superior.
    > Michael Everson * * Everson Typography * *

    Sinnathurai Srivas

    This archive was generated by hypermail 2.1.5 : Sat Jun 25 2005 - 10:58:18 CDT