From: Asmus Freytag (firstname.lastname@example.org)
Date: Thu Sep 21 2006 - 01:13:36 CDT
On 9/20/2006 10:01 PM, Doug Ewell wrote:
> Hans Aberg <haberg at math dot su dot se> wrote:
>> Relative to that stuff, I suggest to compress the character data, as
>> represented by the code points, rather any character encoded data.
>> Typically, a compression method build a binary encoding based on a
>> statistical analysis of a sequence of data units. So if applied to
>> the character data, there results a character encoding from such a
>> compression. Conversely, any character encoding can be viewed as a
>> compression method with certain statistical properties.
> Different compression methods work in different ways. Certainly, a
> compression method that is specifically designed for Unicode text can
> take advantage of the unique properties of Unicode text, as compared
> to, say, photographic images.
> I've often suspected that a Huffman or arithmetic encoder that encoded
> Unicode code points directly would perform better than a byte-based
> one working with UTF-8 code units. I haven't done the math to prove
> it, though.
You missed attending the very *first* Unicode Implementers Workshop
where one of the presentations did precisely that kind of math. If you
assume a large alphabet, then your compression gets worse, even if the
actual number of elements are few. SCSU and similar algorithms reduce
the effective alphabet to +-127, which is much less than 0x10FFFF+1.
There is information in *where* in a large alphabet your subset resides.
If you don't model that well, or use a scheme that happens to model this
situation well, you pay a price of a few fractional bits per character -
almost 1 if you use an 8-bit Huffman on 16-bit character data.
>> When compressing character encoded data, one first translates it into
>> character data, and compresses that. So it does then not matter which
>> character encoding originally is used in the input, as the character
>> data will be the same: the final compression need only to include the
>> additional information about what was the original character encoding
>> to restore data.
> Actually, it does matter for some compression methods, such as the
> well-known LZW. Burrows-Wheeler is fairly unusual in this regard.
> Doug Ewell
> Fullerton, California, USA
> RFC 4645 * UTN #14
This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 01:19:00 CDT