Re: Tamil Collation vs Transliteration/Transcription Enc Version2

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sun Jun 26 2005 - 06:05:31 CDT

  • Next message: James Kass: "Re: Authography and Grammar"

     Sinnathurai Srivas wrote:

    >>> For example Tamil K will indicate k, h, g, q, x and other related
    >>> phoneme
    >>> while Devanagari would have individual character shapes representing
    >>> individual phonemes. Tamil is based on Alphabet based phonemic system,
    >>> while Devanagari is based on phonemic system.

    >> I think you mean that Tamil spelling uses digraphs for consonants while
    >> Devanagari uses single letters. Unless the Tamil digraphs are sorted like
    >> single letters, this happens to be irrelevant for Unicode.

    > No if by digraphs, you mean
    > http://www.deltatranslator.com/delta/diagraphs.htm.

    Do you just mean then that Tamil orthography is ambiguous?

    > each alphabet represent some related phonemes.

    Vocabulary note: Unlike Indian-based languages, 'alphabet' means the whole
    system, not an individual _letters_.

    >>> If Unicode changes it's policy from the unimportant and non functioning
    >>> transliteration based encoding to one of natural sorting based encoding
    >>> would be a superior solution. However, expecting Unicode to change it's
    >>> encoding philosophy of ISCII based transliteration encoding to one of
    >>> natural sorting based encoding is not going to be easy.

    >> You may care to view the UCA weights as a temporary conversion to a
    >> sorting-based encoding.

    > Can you give some pointers.

    I hope you have read the Unicode Collation Algorithm (
    http://www.unicode.org/reports/tr10/ ). It proceeds in four main steps
    (Section 4)

    Step 1: Convert to Normal Form Decomposed (NFD) - probably not needed for
    Tamil - See Section 7.2 of UCA.
    Step 2: Look up the sequence of 'weights'.
    Step 3: Form the sort key from the weights.
    (Step 4: Use the sort keys like any other sorting algorithm.)

    The 'Level 1' part of the weights is what I was suggesting be thought of as
    a sorting-based encoding. For example, consider the ASCII characters 'B',
    'C', 'b' and 'c' and the latest set of weights (in
    http://www.unicode.org/Public/UCA/latest/allkeys.txt )

    'b' U+0062 Level 1 weight 0F85 Level 2 weight 0020 Level 3 weight
    0002
    'B' U+0042 Level 1 weight 0F85 Level 2 weight 0020 Level 3 weight
    0008
    'c' U+0063 Level 1 weight 0F9D Level 2 weight 0020 Level 3 weight
    0002
    'C' U+0043 Level 1 weight 0F9D Level 2 weight 0020 Level 3 weight
    0008

    The combination of weights is chosen so that 'b' and 'B' both come before
    'c' and 'C', even though their binary Unicode encodings would give the order
    'B', 'C', 'b', 'c'. The Level 3 weights differ so that although 'bc' comes
    before 'Bc', 'Bb' comes before 'bc'. This is a complication that does not
    exist in Tamil.

    >>> We will need to work on what is imposed on Tamil and find software
    >>> solutions to resolve sorting requirements.

    >> If Tamil sorting can be expressed purely by a sorting order of consonants
    >> and vowels, then the answer for sorting words is simply to rearrange the
    >> weights on vowels and letters in the default UCA to accord with this
    .> ordering.

    > 99% yes.

    > Simply, the pulli (virama!), the dependent vowels, vowels and Aytham need
    > to be weighted and that's it.
    > However, by Grammar, because of puLLi/virama there should not be conjuncts
    > in Tamil. However Unicode has decided Tamil has one conjunct. (Not
    > hundreds but one). Instead if treating the Grantham ksh as x, Unicode
    > insists ksh is a conjunct. There is no other complications. So we may need
    > to spend vast amount of mony to fix this insistance by Unicode, does not
    > matter if only one or a thosand Tamil has a conjunct in the form of ksh
    > and if collation need to be implemented as in Tamil design, Tamil need to
    > accept Unicode design and work with it.

    This is not a big problem. In the look-up table of weights, one simply
    inserts an entry for 'ksh' (a 'contraction' - Section 3.1.1.2). See
    discussion of VOWEL SIGN O below.

    > There are double encodings of some phenominan. Unicode violated it's own
    > policy of standardising language by double encoding in the name of
    > canonisim. This is also violation of Unicode architecture, wher by it
    > violates linear and ligature philosophy by mis understanding canonism. see
    > http://www.geocities.com/avarangal/rfc/RFC-TA-content_Tamil.html This
    > unwanted inclusion may cut the 99% simple algorithm to about 80% simple
    > plus 20% extremly complicated and back breaking algorithm, that might
    > cause problem for a long time to come.

    The default weights already address this. The current weight entries for
    VOWEL SIGN O and its decomposition are given in the table by:

    0BCA ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
    0BC6 0BBE ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O

    Note that the sorting algorithm will treat them as identical.

    A similar entry for 'ksh' would start '0B95 0BCD 0BB7'.

    I'm not sure these canonical decompositions are breaches of architecture any
    more than other canonical expansions. I can't get up worked about this
    issue because for Thai, for example, only the decomposed form is available.

    Richard.



    This archive was generated by hypermail 2.1.5 : Sun Jun 26 2005 - 06:07:50 CDT