# RE: UTN #31 and direct compression of code points

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon May 07 2007 - 01:27:47 CDT

• Next message: Philippe Verdy: "RE: UTN #31 and direct compression of code points"

Doug Ewell wrote:
> Envoyé : lundi 7 mai 2007 03:29
> À : Unicode Mailing List
> Objet : UTN #31 and direct compression of code points
>
> I just discovered UTN #31, "Fast Compression Algorithm for Unicode
> Text," which describes a simplified Lempel-Ziv-type approach that is
> applied directly to "16 bit Unicode characters," unlike most LZ and
> other compression algorithms which operate on bytes.
>
> Although "16 bit" still isn't quite right -- there's more to Unicode
> than the BMP -- the approach of compressing Unicode code points directly
> was especially interesting to me, because I've wondered for a long time
> about the statement in the Unicode FAQ on this very topic:
>
> "To get the same effect [as SCSU] with one of the popular general
> purpose algorithms, like Huffman or any of the variants of Lempel-Ziv
> compression, it would have to be retargeted to 16-bit, losing
> effectiveness due to the larger alphabet size. It's relatively easy to
> work out the math for the Huffman case to show how many extra bits the
> compressed text would need just because the alphabet was larger.
> Similar effects exist for LZW."
(...)

First I can see the references to the expression "16-bit Unicode characters"
in the text of the document; it should better read as "16-bit Unicode code
units". I see "16-bit units", which should also be read "16-bit code units".
This does not matter here, because the text remains valid after this minor
change.

The rationale with the choice of 16-bit code units is explained by the
nature of code matches, i.e. their average size, and how we can represent
them efficiently. Compressing 32-bit units would require re-encoding the
code matches on larger bit-fields, as well as to increase the size of the
lookup dictionary to an unreasonable and unbounded limit.

You seem to forget that such Huffman coding requires not only storing the
bitstreams representing each compressed code point, but also the table that
will be needed to decode the bit-stream. On a large alphabet like Unicode,
this conversion table will have a very significant size, unless you have a
predetermined table, based on fixed distributions, but then the compression
will only be effective on texts having the same code point frequencies as in
the fixed distribution.

Huffman coding is effective for example if you know the source language and
script of most of the texts you have to compress, but if there are texts
that have different distributions, the Huffman coding may be MUCH larger.

On the opposite, the SCSU algorithm that was adopted by Unicode with a
formal definition for general use, does not depend on language and script
choice (at least with almost all modern languages written only with
characters that are part of the current code point ranges when the SCSU
algorithm was finalized).

And it does not depend on first building a conversion table or not even some
statistics from a significant part of the text to encode, so the SCSU
algorithm can be applied "in the fly", i.e. without prior knowledge about
the text and in a single pass, with a very small encoding state (of just one
code point). There may be a few cases where a few other look-ahead code
points may be useful to optimize the result further, but experiments have
shown that the gain is very modest, and not always worth the effort (in
terms of code complexity and encoding throughput measured in encoded
megabytes per second on the media where the compressed byte-stream will be
transported or stored).

SCSU has another advantage: it is fully deterministic and not more complex
if the encoded stream was optimized or not, and the worst case does have a
very small and strictly limited overhead face to a uncompressed stream; this
makes SCSU even more interesting for permanent storage of large volumes of
texts, to increase the effective throughput of the decoded stream when
retrieving it, and to reduce the average access time, due to a higher hit
rate in data caches (for example in large relational databases).

The decoding algorithm is also so simple that it is insignificant in such a
case, because the performance limitation is not dominated by the decoding
algorithm (which may be implemented even on a device with a small CPU with
very modest performance), but by the I/O throughput and its access time and
locality and cachability (something that compression with SCSU will
significantly enhance).

But now we look at this LZ-like algorithm; remember that this is just a
technical note, for information purpose, it is provided mostly as a call for
comments, for improvement; it just describes some principles in a public
document with a copyright assigned to the author and to Unicode, possibly as
a proof of prior art to protect the principle against the possible
development of an aggressive patent claim, for what could become later an
approved algorithm, and it is made as a possible response to the possible
patent claims on BOCU-1 or SCSU.

In fact, the conclusion of this document should consider comparing the
(still unnamed) LZ-like algorithm against SCSU and BOCU-1. I consider it as
a viable alternative, because the decompression algorithm is fully
deterministic too, and very straightforward to implement, as demonstrated in
the sample code.

This archive was generated by hypermail 2.1.5 : Mon May 07 2007 - 01:31:16 CDT