RE: Compression through normalization

From: Philippe Verdy (
Date: Tue Nov 25 2003 - 18:55:41 EST

  • Next message: Philippe Verdy: "RE: What is a process?"

    Doug Ewell writes:
    > Yes, you can take SCSU- or BOCU-1-encoded text and recompress it using a
    > GP compression scheme. Atkin and Stansifer's paper from last year is
    > all about that, and I spend a few pages on it in my paper as well. You
    > can also re-Zip a Zip file, though, so I don't know what that proves
    > about the compression formats.

    Compressors are characterized by their capability of recompressing if needed
    their output. But you can't recompress the output of SCSU or BOCU-1 simply
    because their output is not a stream of code points but a stream of byte.
    The best you can do is to regenerate the codepoints but this would mean
    decompressing and recompressing. There's no interest to do so with SCSU and
    BOCU-1, as there's no guarantee that your de/re-compression will be better
    or worse or even fully identical to the initial compressed format...

    So SCSU and BOCU-* formats are NOT general purpose compressors. As they are
    defined only in terms of stream of Unicode code points, they are assumed to
    follow the conformance clauses of Unicode. As they recognize their input as
    Unicode text, they can recognize canonical equivalence, and thus this
    creates an opportunity for them to consider if a (de)normalization or
    de/re-composition would result in higher compression (interestingly, the
    composition exclusion could be reconsidered in the case of BOCU-1 and SCSU
    compressed streams, provided that the decompression to code points will
    redecompose the excluded compositions).

    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE!

    This archive was generated by hypermail 2.1.5 : Tue Nov 25 2003 - 19:43:53 EST