Re: Compression and Unicode [was: Name Compression]

From: Asmus Freytag (
Date: Fri May 12 2000 - 04:08:14 EDT

At 09:16 PM 5/10/00 -0800, Juliusz Chroboczek wrote:
>More generally, I get the impression that the Unicode community is
>particularly keen on inventing /ad hoc/ compression schemes. I still
>haven't heard a sound rationale for the existence of the SCCS. What's
>wrong with patent-free variants of LZW?

It's SCSU.

You apparently don't seem to realize that SCSU bridges the gap between
an 8-bit based LZW and a 16-bit encoded Unicode text, by removing the
extra redundancy that is part of the endoding (sequences of every other
byte being null) and not a redundancy in the content. The output of SCSU
should be sent to LZW for block compression where that's desired.

To get the same effect with a generic algorithm, it would have to be
retargeted to 16-bit, losing effectiveness due to the larger alphabet size.
This was a topic already featured at the very earliest Unicode Implementers
Workshop in 1991, where Greg Hullender worked out the math in his slides
for the Huffman case to show how many extra bits the huffmann code would
need just because the alphabet was larger.

One of the key design points of SCSU was that it should work with small
strings. Starting a new LZW for each string is not efficient, it is
probably wasteful. SCSU usually does not need more than one or two bytes
overhead, and often 0 bytes to start up.

Another design point of SCSU is that it is editable (you can replace a
piece in the middle, w/o having to change the stuff at the beginning or the

Finally, and curiously, it was not so much the smallest strings the SCSU
designers wanted, but to get pretty much all types of Unicode encoded legacy
data to be as compact as in the equivalent legacy encoding.

Whether or not these things are important to your overall design, is a
different matter, but as long as they are, SCSU is superior to generic

I suspect that the Unicode nameslist has features that allow a better model
to be constructed over which the text then is arithmetically coded. Another
factor is probably (I didn't check this) the different ability to do semi
random accesses into the middle of compressed text. Adaptive Huffman codes
cannot be used, and you did not suggest that, but you would need to keep
bit indices to access substrings, whereas with the scheme presented (I
assume) you only need to store byte indeces.

In Torsten's scheme (I did check now) he supports separator bits, to trade
the space saved for an index for longer lookup time (scanning).

I wonder whether he has measured the difference in size from the effect of
dedicating one bit to the cause of SPACE vs HYPHEN. Not doing that would
double the number of words that can be represented in one byte, at the cost
of some duplication in the word table. Hard to know how many words really
show up in both forms.

His word table could be further compressed.

PS:) coders will be coders, they like to invent new coding schemes

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT