Recently John Bennett said:
>> I had assumed that traditional compression algorithms looked for repeats
>> on an 8-bit basis and, hence, would fail to compress Unicode.  Is this
>> assumption correct/incorrect?
>
>The compressions do work on an 8-bit basis, but looking at Unicode text as a
>sequence of bytes will still find a lot of pattern.  It just doesn't do as
>good a job as it would if it dealt with 16-bit chunks.
Actually, in a single language document (say an ISO Latin-I language),
the upper byte of almost every 16-bit code would be identical.  In English
for example, the upper byte would be all zeros and the lower byte would be
equivalent to ASCII.  Thus I would expect current text compressors to do
an excellent job on Unicode since the 8-bit pattern in the upper byte
would obviously have a very high frequency; both statistical modeling and
dictionary methods now in use should be able to use this fact.  But if I
understand what John said, a Unicode specific algorithm could do better.
I'd be most surprised if it did a whole lot better.  Even in a two-language
document, the same argument applies to the upper byte frequency.
Wayne Pollock,	pollock@acm.org
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:30 EDT