RE: UTN #31 and direct compression of code points

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed May 09 2007 - 15:01:47 CDT

  • Next message: Philippe Verdy: "RE: Adding Lowercase Letters"

    Doug Ewell wrote:
    > Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:
    >
    > > The algorithm given is clearly for compressing UTF-16 data. Look at
    > > the sign test for three byte difference values. (It could be
    > > adjusted/corrected to handle arbitrary codepoint differences.) I
    > > wonder if SCSU would out-perform the algorithm on, say, Shavian.
    >
    > Shavian can be encoded extremely efficiently in SCSU: only one byte per
    > character, plus three bytes of overhead (0B 60 08) at the start of the
    > stream to set up a dynamic window, and another (01) to quote each U+00B7
    > "namer dot." I doubt the simplified LZ method presented in UTN #31 can
    > top this, but of course there's nothing like experimentation.

    No, I have doubt that SCSU will outperform an algorithms based on
    Lempel-Ziv, even a simplified version. Remember the effect of detecting
    previous matches: matches are encoding sequences of "arbitrary" length as
    very few bytes (complete words, expressions, or common word prefixes and
    suffixes and common punctuation around them), so you'll get less thanone
    byte per character in typical texts, like for Latin.

    However, more advanced Lempel-Ziv-based algorithms are now widely deployed
    and used, including for communication on networks with limited bandwidth, or
    on storage (for faster retrieval and less misses in data caches).

    Given that the computing speed progresses faster than the technological
    limitations (or price) on data access time on communication channels, and
    that general purpose compression algorithms are highly optimized and easily
    available on lots of platforms, why would we need a simplified compression
    algorithm?

    The effective challenge in the computing industry is not much about the
    storage space but about communications that must be secured. In a good
    design, an application should be layered between the processing steps and
    the communication steps and its interfaces.

    The only reason I see why one would need a very simple algorithm is for
    integrating it within another algorithm in a non-layered development
    approach, and even if the compression algorithm is very simple, it will
    still obscure the way the rest of the application using the decompressed
    data is implemented. And this is the best source for bugs, multiple
    implementations in the same applications, forgotten corrections later, and
    difficulties for the deployment or maintenance on complex systems.



    This archive was generated by hypermail 2.1.5 : Wed May 09 2007 - 15:03:30 CDT