Re: Name Compression. Comparison and Tweaks

From: Juliusz Chroboczek (jec@dcs.ed.ac.uk)
Date: Sat May 13 2000 - 15:30:42 EDT


KW> I know Juliusz looks askance at the creation of new, ad hoc,
KW> compression algorithms by the Unicode crowd.

No, I was just surprised at how a number of /ad hoc/ encodings are
devised, and never seem to be compared to standard methods. I think
that your analysis is symptomatic of this attitude, as you compare
three (four) different specific methods with no reference to standard
compression techniques.

In other words: there's nothing wrong in inventing a better wheel.
But please tell the world *why* it is better.

Anyway, I've written a quick hack to compute Huffman tables. Here are
the conclusions. Pro memoria, here are Torsten's results:

KW> Approach 2: Torsten Mohrin
KW> Claimed compression: 262,000 bytes ==> 59,000 bytes for the recoded
KW> names + 29,000 bytes for the code->word lookup table. Total size,
KW> around 88,000 bytes.

Caveat: there is non-negligible likelihood that my program is buggy,
as I only wrote the part that computes the Huffman codes. I did not
write a decoder.

Caveat two: Huffman coding is not the best general-purpose compression
technique available. It is easy to implement, though.

I used roughly 270 Kb of Unicode 3.0 names, totalling 9980 names
(after eliminating the numbered CJK characters and Braille).

Simply Huffman encoding the table led to 1325741 bits, roughly 166 kb,
or 40% compression, which is surprisingly low for all-caps text (and
might point at a bug in my implementation). The Huffman table had 57
nodes.

Using Thorsten's idea of reencoding the table as a list of pointers to
a table of words (where the ending terminator ` ', `-' or end-of-name
is encoded as part of the word) leads to a compressed size of 392662
bits, roughly 49kb. Unfortunately, as there are 4274 words, the
Huffman table has 9980 nodes. The longest codes are 16 bits long.

Finally, Huffman-encoding the list of words compresses it to 117578
bits, roughly 15 kb (that's an average of 27.5 bits per word). The
Huffman table, again, has 57 nodes.

The Huffman codes are self-terminating, so there is no need to encode
the length of names or of words. On the other hand, in an actual
implementation you may want to byte-align both names and words, which
will lead to some padding overhead.

Conclusion: unless there is a clever way of encoding large binary
trees that I don't know about, it is probably not worthwile to Huffman
encode the list of names. It is clearly worthwile to apply standard
compression techniques to the list of words.

Feel free to contact me if you want a copy of the Huffman tables, or
of the code that generated them (in Common Lisp).

Sincerely,

                                        Juliusz Chroboczek



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT