From: Doug Ewell (dewell@adelphia.net)
Date: Thu Sep 21 2006 - 00:01:46 CDT
Hans Aberg <haberg at math dot su dot se> wrote:
> Relative to that stuff, I suggest to compress the character data, as 
> represented by the code points, rather any character encoded data. 
> Typically, a compression method build a binary encoding based on a 
> statistical analysis of a sequence of data units. So if applied to the 
> character data, there results a character encoding from such a 
> compression. Conversely, any character encoding can be viewed as a 
> compression method with certain statistical properties.
Different compression methods work in different ways.  Certainly, a 
compression method that is specifically designed for Unicode text can 
take advantage of the unique properties of Unicode text, as compared to, 
say, photographic images.
I've often suspected that a Huffman or arithmetic encoder that encoded 
Unicode code points directly would perform better than a byte-based one 
working with UTF-8 code units.  I haven't done the math to prove it, 
though.
> When compressing character encoded data, one first translates it into 
> character data, and compresses that. So it does then not matter which 
> character encoding originally is used in the input, as the character 
> data will be the same: the final compression need only to include the 
> additional information about what was the original character encoding 
> to restore data.
Actually, it does matter for some compression methods, such as the 
well-known LZW.  Burrows-Wheeler is fairly unusual in this regard.
-- Doug Ewell Fullerton, California, USA http://users.adelphia.net/~dewell/ RFC 4645 * UTN #14
This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 00:12:58 CDT