RE: Compression through normalization

From: Philippe Verdy (
Date: Wed Nov 26 2003 - 09:29:09 EST

  • Next message: Philippe Verdy: "RE: Definitions"

    D. Starner writes:
    > > In the case of GIF versus JPG, which are usually regarded as "lossless"
    > > versus "lossy", please note that there /is/ no "orignal", in the sense
    > > of a stream of bytes. Why not? Because an image is not a stream of
    > > bytes. Period.
    > GIF isn't a compression scheme; it uses the LZW compression scheme, like
    > Unix compress, which is a stream of bytes compressor. Also, if I take my
    > data and encoded it as bytes and stick it into a GIF file with an
    > arbitrary
    > palette, I can get back exactly that data. But if I encode my
    > data as 9 bit
    > chunks and interprete those as Unicode character points (9 bits, because
    > 10 bits would get us undefined code points and 16 would get us surrogate
    > code points), and I emailed it to someone, and the mailer automatically
    > compressed it, I wouldn't consider it lossless if it wouldn't decompress
    > at the other side. And enough stuff in the real world will barf
    > on combining
    > characters, or at least perform suboptimally, that changing the
    > normalization
    > scheme could really cause problems.

    Many "ifs" for something that won't work in practice. Unicode text is not
    random binary data. You already include restrictions to 9 bits because you
    must adapt to Unicode requirements. Why not integrating also the rules for
    canonical equivalences in your binary file encapsulation in a conforming
    Unicode text? I see no reason why you accept some limitations for this
    encapsulation, but not ALL the limitations.

    Look for example at the 9-bit code points: none of them have distinct
    decomposition or recomposition or compatibility equivalent within this
    restricted set. Suppose that a compressor or encoder chooses to compose them
    or make rearrrangements, then all your binary des-encapsulator will need to
    do is to recognize these canonical equivalents, and perform its
    normalization to get beack to 9-bit codes. In your case, this means
    computing the NFC form, and magically, you will see that you have NOT lossed
    any data. The binary content is then preserved and lossless, as there does
    exist a algorithm that can restore ALL the original data.

    If you don't want that such "denormalisation" occurs during the compression,
    don't claim that your 9-bit encapsulator produces Unicode text (so don't
    label it with a UTF-* encoding scheme or even a BOCU-* or SCSU character
    encoding scheme, but use your own charset label)!

    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE!

    This archive was generated by hypermail 2.1.5 : Wed Nov 26 2003 - 10:22:19 EST