RE: Compression through normalization

From: jon@hackcraft.net
Date: Wed Nov 26 2003 - 07:37:24 EST

  • Next message: Andrew C. West: "Re: numeric properties of Nl characters in the UCD"

    > In the case of GIF versus JPG, which are usually regarded as "lossless"
    > versus "lossy", please note that there /is/ no "orignal", in the sense
    > of a stream of bytes. Why not? Because an image is not a stream of
    > bytes. Period. What is being compressed here is a rectangular array of
    > pixels, and that is what is being restored when the image is "viewed". I
    > am not aware of ANY use of the GIF format to compress an arbitrary byte
    > stream.
    >
    > So, by analogy, if the XYZ compression format (I made that up) claims to
    > compress a sequence of Unicode glyphs, as opposed to an arbitrary byte
    > stream, and can later reconstruct that sequence of glyphs exactly, then
    > I argue that it has every right to be called "lossless", in the same
    > manner that GIF is called "lossless", because /there is no original byte
    > stream to preserve/.

    Well there *is* a stream of bytes with GIFs and they *are* reconstructed
    perfectly on decompression. Most of the time it only matters that the image
    isn't altered by the compression (PNG is perhaps a better analogy, since with
    GIF we might be forced to reduce the colour depth of the image to make it work
    with the format, with PNG we can have better compression if we drop to 256 or
    fewer colours but we don't have to). However it could be a issue if we
    performed some operation on the underlying data which treated it as bytes
    (signing the image in BMP format springs to mind as a possibility).

    While in practice we would generally not have such issues (we would move our
    signing operation to after the PNG creation) they could arise (if we have some
    concept of signing an image independent of image format, implemented by
    converting to a canonical format as needed - in such a case PNG being lossless
    in its treatment of the bytestream would make it usable, JPEG being lossy would
    not).

    With a similar operation on Unicode text data we have a similar problem and a
    similar solution. If we have a need to use the underlying bytes (lets say we're
    signing again) we can either move the operation on the bytes until after the
    compression or, if that is not permitted by some requirement, we are forced to
    use a compression scheme that is lossless at the byte level.
    If our concept of signing is independent of encoding then we can move between
    encodings during the compression process (and sign on a canonical encoding) XML
    Signature is an example of this (it treats UTF-8 as a canonical encoding).
    If our concept of signing considers canonically equivalent sequences to be
    equivalent we can move between normalisation forms in the compression process
    and sign and verify on a specified normalisation form (again XML signature
    aludes to this possibility, though it doesn't use it, as it could introduce
    security issues in some cases - though for applications that truly treat
    canonically equivalent sequences as equivalent then this is a viable pre-
    processing step to XML signature).

    So perhaps we should stop talking about "lossy/lossless" and talk about "what
    is lost" in a given operation. The advantage gained (theoretically, at least,
    does anyone have data on how significant this is?) is from removing entropy of
    a type that the compression algorithm is unlikely to be able to remove itself.
    The question is whether this is truly entropy, or if it's actually data. I'd
    lean towards considering it entropy and removing it - but I'd like to be warned
    in advance that this was going to happen, and have other options available.



    This archive was generated by hypermail 2.1.5 : Wed Nov 26 2003 - 08:22:22 EST