From: N. Ganesan (
Date: Fri May 06 2005 - 12:47:05 CDT

    Thanks for all the interesting and useful tech comments.

    Phillippe wrote:
    >Tamil compresses very well for example with SCSU (with nearly one encoded
    >byte per codepoint).

    I'm a mere structural dynamicist and collect, edit of classical Tamil texts.

    Can you tell a little more on SCSU. Any pointers, URLs to
    how it works on texts, say Tamil unicode text? Tamil letters
    are not conjuncts, something similar in this sense to Latin script of Europe.
    The only abugidas that conjunct are the u/uu series acting upon consonants.
    Table 9-10, Chapter 9.6, The Unicode std. 4.0.
    Do these conjuncts in Table 9.10 also have 2 code points?

    Will be of interest if Arvind Thiagarajan can find a solution
    to say 50 times of data compression in MPEG formats.
    But I don't know more about his product.

    Hans Aberg wrote:
    >If one can make 30 times loss-less data compression without being
    >able to search, that would still be interesting for archiving and
    >backup systems. One should note though that the success of data
    >compression depends much on the types of data one is compression, and
    >how intelligent the software is in finding regularities.

    Journal articles are retrieved from accessing college campus databases at home
    from JSTOR. The papers come as pdf files. Since Tamil letters are
    few compared to say, Devanagari, and basically no conjuncts (unlike
    North Indian scripts), can an OCR program coupled with
    SCSU read paper books as the unicode text in Tamil possible?
    There are lot of patterns as only 50 or so glyphs for Tamil
    which can be usefully exploited. May be
    vendors can test OCR+SCSU out in Tamil first, as this must be way simpler
    than devanagari due to orthography.

    Naga Ganesan

