RE: UTN #31 and direct compression of code points

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon May 07 2007 - 01:27:47 CDT

  • Next message: Philippe Verdy: "RE: UTN #31 and direct compression of code points"

    Doug Ewell wrote:
    > Envoyé : lundi 7 mai 2007 03:29
    > À : Unicode Mailing List
    > Objet : UTN #31 and direct compression of code points
    >
    > I just discovered UTN #31, "Fast Compression Algorithm for Unicode
    > Text," which describes a simplified Lempel-Ziv-type approach that is
    > applied directly to "16 bit Unicode characters," unlike most LZ and
    > other compression algorithms which operate on bytes.
    >
    > Although "16 bit" still isn't quite right -- there's more to Unicode
    > than the BMP -- the approach of compressing Unicode code points directly
    > was especially interesting to me, because I've wondered for a long time
    > about the statement in the Unicode FAQ on this very topic:
    >
    > "To get the same effect [as SCSU] with one of the popular general
    > purpose algorithms, like Huffman or any of the variants of Lempel-Ziv
    > compression, it would have to be retargeted to 16-bit, losing
    > effectiveness due to the larger alphabet size. It's relatively easy to
    > work out the math for the Huffman case to show how many extra bits the
    > compressed text would need just because the alphabet was larger.
    > Similar effects exist for LZW."
    (...)

    First I can see the references to the expression "16-bit Unicode characters"
    in the text of the document; it should better read as "16-bit Unicode code
    units". I see "16-bit units", which should also be read "16-bit code units".
    This does not matter here, because the text remains valid after this minor
    change.

    The rationale with the choice of 16-bit code units is explained by the
    nature of code matches, i.e. their average size, and how we can represent
    them efficiently. Compressing 32-bit units would require re-encoding the
    code matches on larger bit-fields, as well as to increase the size of the
    lookup dictionary to an unreasonable and unbounded limit.

    You seem to forget that such Huffman coding requires not only storing the
    bitstreams representing each compressed code point, but also the table that
    will be needed to decode the bit-stream. On a large alphabet like Unicode,
    this conversion table will have a very significant size, unless you have a
    predetermined table, based on fixed distributions, but then the compression
    will only be effective on texts having the same code point frequencies as in
    the fixed distribution.

    Huffman coding is effective for example if you know the source language and
    script of most of the texts you have to compress, but if there are texts
    that have different distributions, the Huffman coding may be MUCH larger.

    On the opposite, the SCSU algorithm that was adopted by Unicode with a
    formal definition for general use, does not depend on language and script
    choice (at least with almost all modern languages written only with
    characters that are part of the current code point ranges when the SCSU
    algorithm was finalized).

    And it does not depend on first building a conversion table or not even some
    statistics from a significant part of the text to encode, so the SCSU
    algorithm can be applied "in the fly", i.e. without prior knowledge about
    the text and in a single pass, with a very small encoding state (of just one
    code point). There may be a few cases where a few other look-ahead code
    points may be useful to optimize the result further, but experiments have
    shown that the gain is very modest, and not always worth the effort (in
    terms of code complexity and encoding throughput measured in encoded
    megabytes per second on the media where the compressed byte-stream will be
    transported or stored).

    SCSU has another advantage: it is fully deterministic and not more complex
    if the encoded stream was optimized or not, and the worst case does have a
    very small and strictly limited overhead face to a uncompressed stream; this
    makes SCSU even more interesting for permanent storage of large volumes of
    texts, to increase the effective throughput of the decoded stream when
    retrieving it, and to reduce the average access time, due to a higher hit
    rate in data caches (for example in large relational databases).

    The decoding algorithm is also so simple that it is insignificant in such a
    case, because the performance limitation is not dominated by the decoding
    algorithm (which may be implemented even on a device with a small CPU with
    very modest performance), but by the I/O throughput and its access time and
    locality and cachability (something that compression with SCSU will
    significantly enhance).

    But now we look at this LZ-like algorithm; remember that this is just a
    technical note, for information purpose, it is provided mostly as a call for
    comments, for improvement; it just describes some principles in a public
    document with a copyright assigned to the author and to Unicode, possibly as
    a proof of prior art to protect the principle against the possible
    development of an aggressive patent claim, for what could become later an
    approved algorithm, and it is made as a possible response to the possible
    patent claims on BOCU-1 or SCSU.

    In fact, the conclusion of this document should consider comparing the
    (still unnamed) LZ-like algorithm against SCSU and BOCU-1. I consider it as
    a viable alternative, because the decompression algorithm is fully
    deterministic too, and very straightforward to implement, as demonstrated in
    the sample code.



    This archive was generated by hypermail 2.1.5 : Mon May 07 2007 - 01:31:16 CDT