Re: minimizing size (was Re: allocation of Georgian letters)

From: James Kass (thunder-bird@earthlink.net)
Date: Sat Feb 09 2008 - 15:14:49 CST

  • Next message: James Kass: "Re: minimizing size (was Re: allocation of Georgian letters)"

    Doug Ewell asked and answered,

    > Q: Then how can we map text between the current Tamil Unicode encoding model and a more "correct" sequence of units that reflects
    > the way Tamil script users think of their script?
    >
    > A: By using the named sequences provided in http://www.unicode.org/Public/5.1.0/ucd/NamedSequencesProv-5.1.0d1.txt. The use of
    > named sequences is described in UAX #34, "Unicode Named Character Sequences."
    >
    > (Note: the "provisional" named sequences for Tamil will probably need to be upgraded to full approved status before users will
    > take this advice seriously.)

    That's a good answer. Another possible answer is that it isn't necessary
    to either map to or encode "correct" characters as the Tamil script users
    think of their script, because Unicode is already doing this. Here's how:

    Let's consider this: "தொல்காப்பியம்". It can be transliterated to the
    Latin script as "tolkAppiyam". This title can be represented uniquely as
    a Unicode string which can't be confused with any other Unicode string.

    The way தொல்காப்பியம் is stored electronically using Unicode is as
    follows (but without any line feeds):

    0000101110100100000010111100101000001011101100100000101111001101
    0000101110010101000010111011111000001011101010100000101111001101
    0000101110101010000010111011111100001011101011110000101110101110
    0000101111001101

    The TACE/TUNE encodings consider syllables and pure consonants to
    be characters for encoding purposes. If we break this title down into
    TACE/TUNE style encoding, we see that Unicode already offers a
    unique method of storing all of this information electronically.

    Everything TACE/TUNE wants encoded in Unicode is *already* encoded
    in Unicode:

    TACE/TUNE syllable "TO" 00001011101001000000101111001010

    TACE/TUNE pure consonant "L" 00001011101100100000101111001101

    TACE/TUNE syllable "KAA" 00001011100101010000101110111110

    TACE/TUNE pure consonant "P" 00001011101010100000101111001101

    TACE/TUNE syllable "PI" 00001011101010100000101110111111

    TACE/TUNE letter "YA" 0000101110101111

    TACE/TUNE pure consonant "M" 00001011101011100000101111001101

    So, it all depends on the desired level of granularity. Even if someone
    were to propose that every word in all Tamil dictionaries be given a
    unique sequence to ease text processing, the answer is that Unicode
    already does this.

    Best regards,

    James Kass



    This archive was generated by hypermail 2.1.5 : Sat Feb 09 2008 - 15:30:29 CST