From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed May 09 2007 - 15:01:47 CDT
Doug Ewell wrote:
> Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:
>
> > The algorithm given is clearly for compressing UTF-16 data. Look at
> > the sign test for three byte difference values. (It could be
> > adjusted/corrected to handle arbitrary codepoint differences.) I
> > wonder if SCSU would out-perform the algorithm on, say, Shavian.
>
> Shavian can be encoded extremely efficiently in SCSU: only one byte per
> character, plus three bytes of overhead (0B 60 08) at the start of the
> stream to set up a dynamic window, and another (01) to quote each U+00B7
> "namer dot." I doubt the simplified LZ method presented in UTN #31 can
> top this, but of course there's nothing like experimentation.
No, I have doubt that SCSU will outperform an algorithms based on
Lempel-Ziv, even a simplified version. Remember the effect of detecting
previous matches: matches are encoding sequences of "arbitrary" length as
very few bytes (complete words, expressions, or common word prefixes and
suffixes and common punctuation around them), so you'll get less thanone
byte per character in typical texts, like for Latin.
However, more advanced Lempel-Ziv-based algorithms are now widely deployed
and used, including for communication on networks with limited bandwidth, or
on storage (for faster retrieval and less misses in data caches).
Given that the computing speed progresses faster than the technological
limitations (or price) on data access time on communication channels, and
that general purpose compression algorithms are highly optimized and easily
available on lots of platforms, why would we need a simplified compression
algorithm?
The effective challenge in the computing industry is not much about the
storage space but about communications that must be secured. In a good
design, an application should be layered between the processing steps and
the communication steps and its interfaces.
The only reason I see why one would need a very simple algorithm is for
integrating it within another algorithm in a non-layered development
approach, and even if the compression algorithm is very simple, it will
still obscure the way the rest of the application using the decompressed
data is implemented. And this is the best source for bugs, multiple
implementations in the same applications, forgotten corrections later, and
difficulties for the deployment or maintenance on complex systems.
This archive was generated by hypermail 2.1.5 : Wed May 09 2007 - 15:03:30 CDT