Compression
Q: I need to compress Unicode data. Is
there anything special to consider?
A: Unicode has defined a
Standard Compression
Scheme for Unicode (SCSU). While its use is not required, it has
certain key characteristics that you should know about because they make
it attractive for certain tasks. [AF]
Q: Why not use UTF-8 as compressed format?
A: UTF-8 represents only the ASCII characters in less space
than needed in UTF-16, for some characters (other European and Middle
Eastern) it requires the same space as UTF-16; and for all other
characters (South, Southeast, and East Asian) it requires more space.
(Both UTF-8 and UTF-16 always require less space than UTF-32.)
[AF]
Q: What's wrong with using standard
compression algorithms such as Huffman coding or patent-free variants of
LZW?
A: SCSU bridges the gap between an 8-bit based LZW and a
16-bit encoded Unicode text, by removing the extra redundancy that is
part of the encoding (sequences of every other byte being the same) and
not a redundancy in the content. The output of SCSU should be sent to
LZW for block compression where that is desired.
To get the same effect with one of the popular general
purpose algorithms, like Huffman or any of the variants of Lempel-Ziv
compression, it would have to be retargeted to 16-bit, losing
effectiveness due to the larger alphabet size. It's relatively easy to
work out the math for the Huffman case to show how many extra bits the
compressed text would need just because the alphabet was larger. Similar
effects exist for LZW. For a detailed discussion of general text
compression issues see the book Text Compression by Bell, Cleary
and Witten (Prentice Hall 1990).
Q: What are the design points for SCSU?
A: One of the key design points of SCSU was that it should
work with small strings. Starting a new LZW for each string is not
efficient; it is probably wasteful. SCSU usually does not need more than
one or two bytes overhead, and often 0 bytes to start up.
Another design point of SCSU is that it is editable (you
can replace a piece in the middle, w/o having to change the stuff at the
beginning or the end.)
Finally, it was not so much the smallest strings the SCSU
designers wanted, but to get pretty much all types of Unicode encoded
legacy data to be as compact as in the equivalent legacy encoding.
Whether or not these capabilities are important to your
overall design, is a different matter, but as long as they are, SCSU is
superior to generic algorithms. [AF]
Q: What about compressing longer texts?
A: The best way to compress large Unicode-encoded text is
via a Burrows-Wheeler type compression, which gives near identical
results no matter which encoding form was used. Even using SCSU first
does not significantly improve the overall compression, underscoring the
encoding form insensitivity of the algorithm.
For a description of the Burrows-Wheeler compression
algorithm (which is, for example, used in bzip2), see Data
Compression by David Salomon (Springer 2000).
Other general purpose compression algorithms, such as
Huffman, or LZ77 (found, for example, in GNU gzip) are sensitive to the
encoding forms used, and can benefit from using SCSU first.
Whether to use bzip2 or SCSU + gzip depends on availability
plus consideration of running time vs. final compression achieved. It
may be necessary to conduct a comparison with actual data to find the
true optimum. [AF]
Q: Are there disadvantages to using SCSU?
A: Unlike some other schemes, strings compressed with SCSU
cannot be compared binarily for equality of contents. That is because
the encoder has the choice of compression strategies, and different
encoders may make different choices for the same string. While you could
compare strings for equality if they are compressed by the same encoder,
the comparison order in case of strings of different contents will not
be the same as the binary comparison order for the original strings (in
the general case). [AF]
Q: Are there security concerns with SCSU?
A: Because identical strings can have different compressed
representations, filtering of compressed strings for insecure contents
can fail. [AF]