From: Hans Aberg (haberg@math.su.se)
Date: Wed Sep 20 2006 - 12:53:14 CDT
On 20 Sep 2006, at 04:14, Doug Ewell wrote:
> Hans Aberg <haberg at math dot su dot se> wrote:
>
>> It is probably more efficient to translate the stream into code  
>> points and then use a compression technique on that, because then  
>> the full character structure is taken into account. Then it does  
>> not matter which character encoding is used.
>
> If you have not yet read Unicode Technical Note #14, particularly  
> the sections on "general-purpose compression" and "two-layer  
> compression," you might wish to do so.
Relative to that stuff, I suggest to compress the character data, as  
represented by the code points, rather any character encoded data.  
Typically, a compression method build a binary encoding based on a  
statistical analysis of a sequence of data units. So if applied to  
the character data, there results a character encoding from such a  
compression. Conversely, any character encoding can be viewed as a  
compression method with certain statistical properties.
When compressing character encoded data, one first translates it into  
character data, and compresses that. So it does then not matter which  
character encoding originally is used in the input, as the character  
data will be the same: the final compression need only to include the  
additional information about what was the original character encoding  
to restore data.
There is the problem of large translation tables. But that belongs to  
the chapter of table compression, or alternatively, one can use a aet  
of character encodings that, though not providing the most efficient  
compression, may admit compact translation functions. On the other  
hand, a translation table of just a hundred thousand characters is  
not so big anymore in todays computers.
And one can go further, doing a statistical analysis on typical text  
in the different languages, identifying words, and their typical  
statistical frequencies. A compression would then identify common  
words, suitable for compression, and give them one entry in the  
translation table.
   Hans Aberg
This archive was generated by hypermail 2.1.5 : Wed Sep 20 2006 - 13:11:54 CDT