Re: Tamil Collation vs Transliteration/Transcription Enc

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Fri Jun 24 2005 - 17:52:19 CDT

  • Next message: Sinnathurai Srivas: "Re: Tamil Collation vs Transliteration/Transcription Enc"

    In reply to Sinnathurai Srivas, I'll deal with the least antagonising point
    first:

    > It is very important that we start work on fixing the bug caused by
    > transliteration based encoding to do the collation as required. We will
    > analyse the collation techniques available to fix the problem caused by
    > transliteration based encoding bug.

    Have you read and understood the Unicode Collation Algorithm (
    http://www.unicode.org/reports/tr10/ )? What you most importantly need to
    propose is a re-ordering of the weights (basically
    195C.0020.0002 to 1972.0020.0002) assigned to the Tamil consonants (U+0B95
    to U+0BB9) in http://www.unicode.org/Public/UCA/latest/allkeys.txt
    (currently Version 4.1.0). If you can demonstrate that your proposed
    weights gives the correct order, I don't see why the change shouldn't be
    accepted. If you can fix any other collation 'errors' at the same time, I
    think so much the better.

    There is no explicit undertaking that the default Unicode Collation
    Algorithm is correct for any language, but I am not aware of any reason that
    it would be wrong to make it work properly for the collation of items in the
    Tamil script. I can't believe collating Sanskrit correctly in the Tamil
    script is an important consideration.

    It might be worth advising what to do if any more Grantha letters are
    restored to the Tamil block, e.g. as SHA U+0B96 was at 4.1.0. I for one
    would not be surprised if there were antiquarian fonts that added the entire
    Grantha script to Tamil, using the obvious code points. [Would Uniscribe
    provide OpenType support for such an 'extended-Unicode' (i.e. strictly
    speaking, not Unicode) font? Does FreeType?]

    > Though it undergoes numerous implementation problems, Unicode is based on
    > a highly sophisticated technical architecture. In this article how Unicode
    > mishandled Tamil collation and analyses the alternative solutions to
    > attain Tamil Collation.

    > Any implementation would initially attempt for a natural sort order for a
    > language, where by the default hex order of codes would be a natural sort
    > order of that language. The question now is why Unicode decided to deny
    > this natural facility to Tamil, in its implementation strategy. The answer
    > is, in Unicode's consideration there is another requirement that was
    > considered more important than sorting order of Tamil. The requirement
    > was, the transliteration properties of code order of all Indian languages
    > must be the same and sort order was considered a minute matter in
    > comparison to sort order. Unicode decided that writing softwares to
    > transliterate between different Indic languages is a more daunting task
    > than writing software to collate a language.

    > Unlike Latin based languages, each Indic languages use alphabet of their
    > own. For this reason abandoning natural sort order in favour of
    > transliteration sort order was not a technical but a political decision by
    > Unicode. <snip> Software routines to do transliteration is a simple task,
    > compared to software routines to collate a scrambled encoding.

    You could argue that the Indian ones are almost treated as though they
    shared a common alphabet. They therefore suffer the same way as
    Latin-script languages do - the collation of letters beyond the basic 26 is
    similarly messy, and life gets even more complicated with languages that
    insist on treating digraphs as independent letters (e.g. at least CH, PH,
    TH, LL, NG and DD in Welsh). All Latin-script languages suffer from the
    fact that 'B' comes before 'a' in binary order, while for human use they are
    much better sorted the other way round.

    > Unicode transliteration scheme does not work. The saddest thing of all is
    > that the transliteration does not work as Unicode hoped it. There never
    > was a simple transliteration mechanism suitable for encoding different
    > languages. For example, Tamil writing system is based on phonemic based
    Alphabet system, while Devanagari is based on phonemic only system. In Tamil
    k = k, h, g, x, q, c (mahaL, magan, makkan, quil, xavier, etc..).

    Surely the point here is that one attempts to write modern Tamil as though
    it had the same phonology as Classical Tamil. Is it safe to claim aspirated
    (as opposed to fricative) phonemes for Tamil? Claiming voicing contrasts
    brings down a stream of invective.

    > In Devanagari individual glyph shapes represent each of these phonemes. In
    > Tamil aspirated and many other sounds are written using a single
    > modulating indicator called Aytham, yet an unacceptably high number of
    > code points allocated for Tamil is deprecated and made unusable because of
    > this transliteration encoding that never works.

    I'm not sure what you mean here? Are you saying, for example, that U+0BA6
    is TAMIL LETTER DA but is deprecated? The official position is that U+0BA6
    is not assigned and cannot be used at present, but I presume that it is
    being reserved until such time, if ever, that Tamil (or a language using the
    Tamil script) readmits Grantha DA (in a suitable modern form), and I would
    hope that any script-sensitive renderer would support such an encoding
    without having to be upgraded if the assignment were ever made.

    I wonder if similar holes in the Lao block are handled like this.

    Richard.



    This archive was generated by hypermail 2.1.5 : Fri Jun 24 2005 - 17:53:53 CDT