Q: I need to compress Unicode data. Is
there anything special to consider?
A: Unicode text is often stored and usually exchanged in UTF-8,
which is compact for ASCII-heavy markup (HTML, XML, JSON, etc.).
General-purpose compression algorithms often use much less than one byte per character.
Whether or not to use a compression algorithm depends on the proportion of text,
compared with non-text data, such as images
(which may already be compressed or be compressible),
and on the cost of the storage and transmission of the data.
Q: Why not use UTF-8 as compressed format?
A: UTF-8 is the default encoding for text on the internet.
It is reasonably compact, simple, and universally supported;
protocols like HTTP also offer additional compression (e.g., gzip).
However, all non-ASCII characters are represented in UTF-8 using more than one byte per character,
and all CJK characters require at least three.
If an application would benefit from compact or compressed text,
then UTF-8 is not optimal.
Q: What is SCSU?
A: Unicode has defined a
Standard Compression Scheme for Unicode (SCSU).
It is a compact encoding that stores most text with one byte per character, or two for CJK.
Q: What are the design points for SCSU?
A: One of the key design points of SCSU was that it should
work with small strings. Starting a new general-purpose compression for each string
is probably wasteful. SCSU usually does not need more than
one or two bytes overhead, and often 0 bytes to start up.
Furthermore, it was not so much the smallest strings the SCSU
designers wanted, but to get most types of Unicode encoded
data to be as compact as in the equivalent legacy encoding.
For example, most simple scripts require a single byte per character in SCSU,
and CJK requires two bytes per character.
Whether or not these capabilities are important to your
overall design is a different matter, but as long as they are, SCSU is
superior to generic algorithms. [AF]
Q: What about compressing longer texts?
A: The best way to compress long strings of Unicode-encoded text is
via general-purpose compression,
which is an option in HTTP and other protocols.
Some compression algorithms are sensitive to the input encoding,
and using SCSU first may help to minimize the resulting size;
other algorithms give near identical results no matter which encoding form was used.
For details see
Unicode Technical Note #14
“A Survey of Unicode Compression” and
Does Size Really Matter?”.
Q: Are there disadvantages to using SCSU?
A: Unlike some other schemes, strings compressed with SCSU
cannot be binary compared for equality of contents. That is because
the encoder has the choice of compression strategies, and different
encoders may make different choices for the same string. While you could
compare strings for equality if they are compressed by the same encoder,
the comparison order in case of strings of different contents will not
be the same as the binary comparison order for the original strings (in
the general case). [AF]
Q: Are there security concerns with SCSU?
A: Because identical strings can have different compressed
representations, filtering of compressed strings for unsecure contents
can fail. [AF]
On the web, where encoding declarations are often incorrect,
the text encoding is often detected heuristically;
encodings like SCSU and the obsolete UTF-7
which use bytes 0x20..0x7E for the encoding of non-ASCII characters
have been used to inject malicious code.
Therefore, these encodings must not be used in web documents
(W3C Choosing & applying a character encoding,
HTML5 document character encoding).