Re: Data compression

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sat May 07 2005 - 20:37:42 CDT

  • Next message: Poopathi Manickam: "Re: Tamil Script and Tamil Grantha Script differences"

    At 10:36 AM 5/7/2005, Doug Ewell wrote:
    >It is possible to build an "OK" compressor or a "really good" compressor
    >within the same spec. This is also true for some types of non-text
    >compression.

    Unlike other compression schemes, the performance of a truly sophisticated
    compressor and a basic compressor are often very close, unless the basic
    compressor is written to be intentionally 'stupid'.

    The main reason for allowing different compressors was to make sure that
    certain types of strings, for example in Japanese, could utilize compressors
    that were optimized for the particular mix of scripts found in that language.
    For a scenario that's exclusively Japanese, the 10% (or so) improvement that
    a more optimized compressor might yield, could be important.

    The UTS comes with two sets of sample code. One in Java, and one in C, the
    former implements a middle-of-the-road compression strategy, where some
    optimization is attempted, but at the same time complexity of the code
    is kept reasonable; the latter presents a very minimal, yet useful encoder.

    See http://www.unicode.org/Public/PROGRAMS/ for the source code.

    The difference in the performance of these two encoders would probably
    not matter, except for really high-volume usage for certain types of
    strings or languages.

    The main reason this is so, is because the fundamental compression model
    is the same, the difference is in the lookahead, and use of some optional
    features. This is similar to the task of optimizing program code for
    speed. Tweaking the code tends to yield improvements in the few percent
    here or there - change to a fundamental algorithm is what really improves
    things.

    A./



    This archive was generated by hypermail 2.1.5 : Sat May 07 2005 - 20:39:42 CDT