Re: Tamil Collation vs Transliteration/Transcription Enc Version2

From: Sinnathurai Srivas (sisrivas@blueyonder.co.uk)
Date: Sun Jun 26 2005 - 15:38:27 CDT

  • Next message: Richard Wordingham: "Re: Tamil sha (U+0BB6) - deprecate it?"

    Why punishing Tamil for mistakes in Grantham and Unicode?

    > 0BCA ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
    > 0BC6 0BBE ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
    >
    > Note that the sorting algorithm will treat them as identical.
    >
    > A similar entry for 'ksh' would start '0B95 0BCD 0BB7'.

    Tamil can process itself at 16 bit (and 8bit)

    why this unnecessary punishment. lets compute Tamil at 16bit. Unicode breaks
    it own architecture with canonism that is not accurate description and force
    Tamil to go the 32 bit way.

    Tamil can process itself at 16 bit (and 8bit)

    Why this punishment by Grantham. ksh forces Tamil to go even the way of 48
    bit way.
    Please find ways to stop this nonsense.

    Tamil do not need all these unwanted punishment. We are innocent please.

    Lets do 16 bit processing. let's stop un-technical canonism.
    Let's stop vastly complex ksh running havoc with Tamil.

    There is only one conjunct, if Unicode insists that it is conjunct and not
    x.
    Why punishing Tamil. Tamil grammar specifically avoids conjuncts by
    simplifying the mating mechanism with pulli. (Unicode must read and
    understand pulli and Tamil Grammar before it declares all indic come from
    same mud.

    Tamil grammar was written long ago. During transit some might have lost the
    root. Why assume it all the same mud, when one did not read about it.

    Srivas

    ----- Original Message -----
    From: "Richard Wordingham" <richard.wordingham@ntlworld.com>
    To: <unicode@unicode.org>
    Sent: Sunday, June 26, 2005 12:05 PM
    Subject: Re: Tamil Collation vs Transliteration/Transcription Enc Version2

    > Sinnathurai Srivas wrote:
    >
    >>>> For example Tamil K will indicate k, h, g, q, x and other related
    >>>> phoneme
    >>>> while Devanagari would have individual character shapes representing
    >>>> individual phonemes. Tamil is based on Alphabet based phonemic system,
    >>>> while Devanagari is based on phonemic system.
    >
    >>> I think you mean that Tamil spelling uses digraphs for consonants while
    >>> Devanagari uses single letters. Unless the Tamil digraphs are sorted
    >>> like
    >>> single letters, this happens to be irrelevant for Unicode.
    >
    >> No if by digraphs, you mean
    >> http://www.deltatranslator.com/delta/diagraphs.htm.
    >
    > Do you just mean then that Tamil orthography is ambiguous?
    >
    Grammar at most of the times defines phonemes.

    As for transliterating foreign words it somewhat ambiguous.
    But it is being investigated and Tamils have not seen an extra urge to find
    solution. It does not mean tamil should be replaced by sanskrit.

    Imagine, because English can not write India, it writes India. Fools the
    whole world and the Indians as it in India. Is that the duty of Unicodew to
    fix this problem in English?

    sanskrit can not write some Tamils words, such as R Rom (not r) or T as in
    Tom.
    It can not write Inthia, because it has no nth. Hindi can not write hindi,
    because it has no nth. They still survive. why tamil is being forced by
    powerful others to change?

    >> each alphabet represent some related phonemes.
    >
    > Vocabulary note: Unlike Indian-based languages, 'alphabet' means the whole
    > system, not an individual _letters_.
    >
    Tamil uses Alphabet based phonemic system. Each character in Tamil .... ....

    >>>> If Unicode changes it's policy from the unimportant and non functioning
    >>>> transliteration based encoding to one of natural sorting based encoding
    >>>> would be a superior solution. However, expecting Unicode to change it's
    >>>> encoding philosophy of ISCII based transliteration encoding to one of
    >>>> natural sorting based encoding is not going to be easy.
    >
    >>> You may care to view the UCA weights as a temporary conversion to a
    >>> sorting-based encoding.
    >
    >> Can you give some pointers.
    >
    > I hope you have read the Unicode Collation Algorithm (
    > http://www.unicode.org/reports/tr10/ ). It proceeds in four main steps
    > (Section 4)
    >
    > Step 1: Convert to Normal Form Decomposed (NFD) - probably not needed for
    > Tamil - See Section 7.2 of UCA.
    > Step 2: Look up the sequence of 'weights'.
    > Step 3: Form the sort key from the weights.
    > (Step 4: Use the sort keys like any other sorting algorithm.)
    >
    > The 'Level 1' part of the weights is what I was suggesting be thought of
    > as a sorting-based encoding. For example, consider the ASCII characters
    > 'B', 'C', 'b' and 'c' and the latest set of weights (in
    > http://www.unicode.org/Public/UCA/latest/allkeys.txt )
    >
    > 'b' U+0062 Level 1 weight 0F85 Level 2 weight 0020 Level 3 weight
    > 0002
    > 'B' U+0042 Level 1 weight 0F85 Level 2 weight 0020 Level 3 weight
    > 0008
    > 'c' U+0063 Level 1 weight 0F9D Level 2 weight 0020 Level 3 weight
    > 0002
    > 'C' U+0043 Level 1 weight 0F9D Level 2 weight 0020 Level 3 weight
    > 0008
    >
    > The combination of weights is chosen so that 'b' and 'B' both come before
    > 'c' and 'C', even though their binary Unicode encodings would give the
    > order 'B', 'C', 'b', 'c'. The Level 3 weights differ so that although
    > 'bc' comes before 'Bc', 'Bb' comes before 'bc'. This is a complication
    > that does not exist in Tamil.
    >
    >>>> We will need to work on what is imposed on Tamil and find software
    >>>> solutions to resolve sorting requirements.
    >
    >>> If Tamil sorting can be expressed purely by a sorting order of
    >>> consonants
    >>> and vowels, then the answer for sorting words is simply to rearrange the
    >>> weights on vowels and letters in the default UCA to accord with this
    > .> ordering.
    >
    >> 99% yes.
    >
    >> Simply, the pulli (virama!), the dependent vowels, vowels and Aytham need
    >> to be weighted and that's it.
    >> However, by Grammar, because of puLLi/virama there should not be
    >> conjuncts in Tamil. However Unicode has decided Tamil has one conjunct.
    >> (Not hundreds but one). Instead if treating the Grantham ksh as x,
    >> Unicode insists ksh is a conjunct. There is no other complications. So we
    >> may need to spend vast amount of mony to fix this insistance by Unicode,
    >> does not matter if only one or a thosand Tamil has a conjunct in the form
    >> of ksh and if collation need to be implemented as in Tamil design, Tamil
    >> need to accept Unicode design and work with it.
    >
    > This is not a big problem. In the look-up table of weights, one simply
    > inserts an entry for 'ksh' (a 'contraction' - Section 3.1.1.2). See
    > discussion of VOWEL SIGN O below.
    >
    >> There are double encodings of some phenominan. Unicode violated it's own
    >> policy of standardising language by double encoding in the name of
    >> canonisim. This is also violation of Unicode architecture, wher by it
    >> violates linear and ligature philosophy by mis understanding canonism.
    >> see http://www.geocities.com/avarangal/rfc/RFC-TA-content_Tamil.html This
    >> unwanted inclusion may cut the 99% simple algorithm to about 80% simple
    >> plus 20% extremly complicated and back breaking algorithm, that might
    >> cause problem for a long time to come.
    >
    > The default weights already address this. The current weight entries for
    > VOWEL SIGN O and its decomposition are given in the table by:
    >
    > 0BCA ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
    > 0BC6 0BBE ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
    >
    > Note that the sorting algorithm will treat them as identical.
    >
    > A similar entry for 'ksh' would start '0B95 0BCD 0BB7'.
    >
    > I'm not sure these canonical decompositions are breaches of architecture
    > any more than other canonical expansions. I can't get up worked about
    > this issue because for Thai, for example, only the decomposed form is
    > available.
    >
    > Richard.
    >
    >



    This archive was generated by hypermail 2.1.5 : Sun Jun 26 2005 - 16:47:45 CDT