From: N. Ganesan (firstname.lastname@example.org)
Date: Fri May 06 2005 - 12:47:05 CDT
Thanks for all the interesting and useful tech comments.
>Tamil compresses very well for example with SCSU (with nearly one encoded
>byte per codepoint).
I'm a mere structural dynamicist and collect, edit of classical Tamil texts.
Can you tell a little more on SCSU. Any pointers, URLs to
how it works on texts, say Tamil unicode text? Tamil letters
are not conjuncts, something similar in this sense to Latin script of Europe.
The only abugidas that conjunct are the u/uu series acting upon consonants.
Table 9-10, Chapter 9.6, The Unicode std. 4.0.
Do these conjuncts in Table 9.10 also have 2 code points?
Will be of interest if Arvind Thiagarajan can find a solution
to say 50 times of data compression in MPEG formats.
But I don't know more about his product.
Hans Aberg wrote:
>If one can make 30 times loss-less data compression without being
>able to search, that would still be interesting for archiving and
>backup systems. One should note though that the success of data
>compression depends much on the types of data one is compression, and
>how intelligent the software is in finding regularities.
Journal articles are retrieved from accessing college campus databases at home
from JSTOR. The papers come as pdf files. Since Tamil letters are
few compared to say, Devanagari, and basically no conjuncts (unlike
North Indian scripts), can an OCR program coupled with
SCSU read paper books as the unicode text in Tamil possible?
There are lot of patterns as only 50 or so glyphs for Tamil
which can be usefully exploited. May be
vendors can test OCR+SCSU out in Tamil first, as this must be way simpler
than devanagari due to orthography.
This archive was generated by hypermail 2.1.5 : Fri May 06 2005 - 12:48:17 CDT