Re: Tamil Collation vs Transliteration/Transcription Enc

From: Sinnathurai Srivas (sisrivas@blueyonder.co.uk)
Date: Fri Jun 24 2005 - 18:21:45 CDT

  • Next message: Sinnathurai Srivas: "Re: terminology: plaintext (was Re: unicode Digest V5 #149)"

    I'll answer to other points with a later mail.

    However it is important that we do not add any Grantham letters to Tamil.
    For example we are considering deprecating the new 0bb6 sh addition.

    As I wrote before Tamil uses Phonemic based alphabets. Grantham uses
    Phonemic only writing system. It will messup the language. Tamil does not
    fall into most Indic language catergory in this respect.

    Tamil is an ancient language with sophisticated Grammar (probably the oldest
    Grammar in the world) to back it up with. grantham does not base the writing
    system to this Grammar. It would be un acceptable to add any more Grantha to
    Tamil.

    Unicode encoded 0bb6 without proper consultation, again offending the
    language in a serious way.

    As for aspirated, i mean dh = d+aytham, kh=k+aytham
    For example unused code point 0b96=0b95+0b83. கஃ (KH)
    Tamil also defines 0b83+0b95 ஃக (HK)which is not find in Grantham.

    In any case it has been a long and hard struggles to keep Grantham out.
    Ofcourse Grantham has won over mostly all of the Indic languages. Tamils
    wish to keep it the way it is defined in the ancient and sopisticated
    Grammar. I do not think Uniocde with its power would once again start to
    distroy this.

    I was talking to some one about adding the special symbol "th" to English
    encoding and he said Unicode does not have the power to arbitarily change
    English, even if we provide amble evidence of usage in English. But he also
    said Unicode has the power to change Tamil and other languages as there is
    no significant power that can stop Uncode, if Unicode decides to do it.

    I hope things do not go that far and hope Unicode will help to deprecate
    0bb6, as this is an addition not necessary for Tamil. It is not justified to
    attack power less people with great and classical traditions, because one
    has power.

    Sinnathurai Srivas

    ----- Original Message -----
    From: "Richard Wordingham" <richard.wordingham@ntlworld.com>
    To: <unicode@unicode.org>
    Sent: Friday, June 24, 2005 11:52 PM
    Subject: Re: Tamil Collation vs Transliteration/Transcription Enc

    > In reply to Sinnathurai Srivas, I'll deal with the least antagonising
    > point first:
    >
    >> It is very important that we start work on fixing the bug caused by
    >> transliteration based encoding to do the collation as required. We will
    >> analyse the collation techniques available to fix the problem caused by
    >> transliteration based encoding bug.
    >
    > Have you read and understood the Unicode Collation Algorithm (
    > http://www.unicode.org/reports/tr10/ )? What you most importantly need to
    > propose is a re-ordering of the weights (basically
    > 195C.0020.0002 to 1972.0020.0002) assigned to the Tamil consonants (U+0B95
    > to U+0BB9) in http://www.unicode.org/Public/UCA/latest/allkeys.txt
    > (currently Version 4.1.0). If you can demonstrate that your proposed
    > weights gives the correct order, I don't see why the change shouldn't be
    > accepted. If you can fix any other collation 'errors' at the same time, I
    > think so much the better.
    >
    > There is no explicit undertaking that the default Unicode Collation
    > Algorithm is correct for any language, but I am not aware of any reason
    > that it would be wrong to make it work properly for the collation of items
    > in the Tamil script. I can't believe collating Sanskrit correctly in the
    > Tamil script is an important consideration.
    >
    > It might be worth advising what to do if any more Grantha letters are
    > restored to the Tamil block, e.g. as SHA U+0B96 was at 4.1.0. I for one
    > would not be surprised if there were antiquarian fonts that added the
    > entire Grantha script to Tamil, using the obvious code points. [Would
    > Uniscribe provide OpenType support for such an 'extended-Unicode' (i.e.
    > strictly speaking, not Unicode) font? Does FreeType?]
    >
    >> Though it undergoes numerous implementation problems, Unicode is based
    >> on a highly sophisticated technical architecture. In this article how
    >> Unicode mishandled Tamil collation and analyses the alternative solutions
    >> to attain Tamil Collation.
    >
    >> Any implementation would initially attempt for a natural sort order for a
    >> language, where by the default hex order of codes would be a natural sort
    >> order of that language. The question now is why Unicode decided to deny
    >> this natural facility to Tamil, in its implementation strategy. The
    >> answer is, in Unicode's consideration there is another requirement that
    >> was considered more important than sorting order of Tamil. The
    >> requirement was, the transliteration properties of code order of all
    >> Indian languages must be the same and sort order was considered a minute
    >> matter in comparison to sort order. Unicode decided that writing
    >> softwares to transliterate between different Indic languages is a more
    >> daunting task than writing software to collate a language.
    >
    >> Unlike Latin based languages, each Indic languages use alphabet of their
    >> own. For this reason abandoning natural sort order in favour of
    >> transliteration sort order was not a technical but a political decision
    >> by Unicode. <snip> Software routines to do transliteration is a simple
    >> task, compared to software routines to collate a scrambled encoding.
    >
    > You could argue that the Indian ones are almost treated as though they
    > shared a common alphabet. They therefore suffer the same way as
    > Latin-script languages do - the collation of letters beyond the basic 26
    > is similarly messy, and life gets even more complicated with languages
    > that insist on treating digraphs as independent letters (e.g. at least CH,
    > PH, TH, LL, NG and DD in Welsh). All Latin-script languages suffer from
    > the fact that 'B' comes before 'a' in binary order, while for human use
    > they are much better sorted the other way round.
    >
    >> Unicode transliteration scheme does not work. The saddest thing of all is
    >> that the transliteration does not work as Unicode hoped it. There never
    >> was a simple transliteration mechanism suitable for encoding different
    >> languages. For example, Tamil writing system is based on phonemic based
    > Alphabet system, while Devanagari is based on phonemic only system. In
    > Tamil k = k, h, g, x, q, c (mahaL, magan, makkan, quil, xavier, etc..).
    >
    > Surely the point here is that one attempts to write modern Tamil as though
    > it had the same phonology as Classical Tamil. Is it safe to claim
    > aspirated (as opposed to fricative) phonemes for Tamil? Claiming voicing
    > contrasts brings down a stream of invective.
    >
    >> In Devanagari individual glyph shapes represent each of these phonemes.
    >> In Tamil aspirated and many other sounds are written using a single
    >> modulating indicator called Aytham, yet an unacceptably high number of
    >> code points allocated for Tamil is deprecated and made unusable because
    >> of this transliteration encoding that never works.
    >
    > I'm not sure what you mean here? Are you saying, for example, that U+0BA6
    > is TAMIL LETTER DA but is deprecated? The official position is that
    > U+0BA6 is not assigned and cannot be used at present, but I presume that
    > it is being reserved until such time, if ever, that Tamil (or a language
    > using the Tamil script) readmits Grantha DA (in a suitable modern form),
    > and I would hope that any script-sensitive renderer would support such an
    > encoding without having to be upgraded if the assignment were ever made.
    >
    > I wonder if similar holes in the Lao block are handled like this.
    >
    > Richard.
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Jun 24 2005 - 18:23:02 CDT