RE: Compression through normalization

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Nov 26 2003 - 09:29:09 EST

Next message: Philippe Verdy: "RE: Definitions"

Previous message: Philippe Verdy: "RE: Compression through normalization"
In reply to: D. Starner: "RE: Compression through normalization"
Next in thread: D. Starner: "RE: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

D. Starner writes:
> > In the case of GIF versus JPG, which are usually regarded as "lossless"
> > versus "lossy", please note that there /is/ no "orignal", in the sense
> > of a stream of bytes. Why not? Because an image is not a stream of
> > bytes. Period.
>
> GIF isn't a compression scheme; it uses the LZW compression scheme, like
> Unix compress, which is a stream of bytes compressor. Also, if I take my
> data and encoded it as bytes and stick it into a GIF file with an
> arbitrary
> palette, I can get back exactly that data. But if I encode my
> data as 9 bit
> chunks and interprete those as Unicode character points (9 bits, because
> 10 bits would get us undefined code points and 16 would get us surrogate
> code points), and I emailed it to someone, and the mailer automatically
> compressed it, I wouldn't consider it lossless if it wouldn't decompress
> at the other side. And enough stuff in the real world will barf
> on combining
> characters, or at least perform suboptimally, that changing the
> normalization
> scheme could really cause problems.

Many "ifs" for something that won't work in practice. Unicode text is not
random binary data. You already include restrictions to 9 bits because you
must adapt to Unicode requirements. Why not integrating also the rules for
canonical equivalences in your binary file encapsulation in a conforming
Unicode text? I see no reason why you accept some limitations for this
encapsulation, but not ALL the limitations.

Look for example at the 9-bit code points: none of them have distinct
decomposition or recomposition or compatibility equivalent within this
restricted set. Suppose that a compressor or encoder chooses to compose them
or make rearrrangements, then all your binary des-encapsulator will need to
do is to recognize these canonical equivalents, and perform its
normalization to get beack to 9-bit codes. In your case, this means
computing the NFC form, and magically, you will see that you have NOT lossed
any data. The binary content is then preserved and lossless, as there does
exist a algorithm that can restore ALL the original data.

If you don't want that such "denormalisation" occurs during the compression,
don't claim that your 9-bit encapsulator produces Unicode text (so don't
label it with a UTF-* encoding scheme or even a BOCU-* or SCSU character
encoding scheme, but use your own charset label)!

__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com

application/ms-tnef attachment: winmail.dat

Next message: Philippe Verdy: "RE: Definitions"
Previous message: Philippe Verdy: "RE: Compression through normalization"
In reply to: D. Starner: "RE: Compression through normalization"
Next in thread: D. Starner: "RE: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 26 2003 - 10:22:19 EST