Re: minimizing size (was Re: allocation of Georgian letters)

From: James Kass ([email protected])
Date: Sat Feb 09 2008 - 15:14:49 CST

Next message: James Kass: "Re: minimizing size (was Re: allocation of Georgian letters)"

Previous message: Doug Ewell: "Re: minimizing size (was Re: allocation of Georgian letters)"
In reply to: Doug Ewell: "Re: minimizing size (was Re: allocation of Georgian letters)"
Next in thread: Eric Muller: "Re: minimizing size (was Re: allocation of Georgian letters)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Doug Ewell asked and answered,

> Q: Then how can we map text between the current Tamil Unicode encoding model and a more "correct" sequence of units that reflects
> the way Tamil script users think of their script?
>
> A: By using the named sequences provided in http://www.unicode.org/Public/5.1.0/ucd/NamedSequencesProv-5.1.0d1.txt. The use of
> named sequences is described in UAX #34, "Unicode Named Character Sequences."
>
> (Note: the "provisional" named sequences for Tamil will probably need to be upgraded to full approved status before users will
> take this advice seriously.)

That's a good answer. Another possible answer is that it isn't necessary
to either map to or encode "correct" characters as the Tamil script users
think of their script, because Unicode is already doing this. Here's how:

Let's consider this: "தொல்காப்பியம்". It can be transliterated to the
Latin script as "tolkAppiyam". This title can be represented uniquely as
a Unicode string which can't be confused with any other Unicode string.

The way தொல்காப்பியம் is stored electronically using Unicode is as
follows (but without any line feeds):

0000101110100100000010111100101000001011101100100000101111001101
0000101110010101000010111011111000001011101010100000101111001101
0000101110101010000010111011111100001011101011110000101110101110
0000101111001101

The TACE/TUNE encodings consider syllables and pure consonants to
be characters for encoding purposes. If we break this title down into
TACE/TUNE style encoding, we see that Unicode already offers a
unique method of storing all of this information electronically.

Everything TACE/TUNE wants encoded in Unicode is *already* encoded
in Unicode:

TACE/TUNE syllable "TO" 00001011101001000000101111001010

TACE/TUNE pure consonant "L" 00001011101100100000101111001101

TACE/TUNE syllable "KAA" 00001011100101010000101110111110

TACE/TUNE pure consonant "P" 00001011101010100000101111001101

TACE/TUNE syllable "PI" 00001011101010100000101110111111

TACE/TUNE letter "YA" 0000101110101111

TACE/TUNE pure consonant "M" 00001011101011100000101111001101

So, it all depends on the desired level of granularity. Even if someone
were to propose that every word in all Tamil dictionaries be given a
unique sequence to ease text processing, the answer is that Unicode
already does this.

Best regards,

James Kass

Next message: James Kass: "Re: minimizing size (was Re: allocation of Georgian letters)"
Previous message: Doug Ewell: "Re: minimizing size (was Re: allocation of Georgian letters)"
In reply to: Doug Ewell: "Re: minimizing size (was Re: allocation of Georgian letters)"
Next in thread: Eric Muller: "Re: minimizing size (was Re: allocation of Georgian letters)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Feb 09 2008 - 15:30:29 CST