Re: UTN #31 and direct compression of code points

From: Doug Ewell (dewell@adelphia.net)
Date: Tue May 08 2007 - 01:20:11 CDT

  • Next message: Doug Ewell: "Re: UTN #31 and direct compression of code points"

    Philippe Verdy <verdy_p@wanadoo.fr> wrote:

    > You seem to forget that such Huffman coding requires not only storing
    > the bitstreams representing each compressed code point, but also the
    > table that will be needed to decode the bit-stream.

    Since I just finished implementing a Huffman encoder and decoder, and
    since I wrote in my last message "except for the tree-storage overhead,
    which usually makes Huffman inappropriate for such short messages
    anyway," and also wrote in UTN #14 about storing the tree, it's probably
    safe to assume that I didn't just forget about storing the tree.

    In a conventional byte-oriented Huffman implementation, for each of N
    encoded symbols (including any "pseudo-EOF" character that is needed to
    tell the decoder when to stop), the tree typically requires 10N - 2 bits
    (8 bits for each terminal node plus 2 for each non-terminal node, of
    which there are N - 1). The total size of the tree obviously depends on
    the number of discrete symbols in the message. For a short but highly
    redundant message like "she sells seashells by the seashore" (entropy =
    3.04), the tree is small and the total encoded message, tree plus
    bitstream, is smaller than the plaintext. For a short pangram like "the
    quick brown fox jumps over the lazy dog" (entropy = 4.38), the tree
    alone is larger than the plaintext.

    An implementation that allows the entire Unicode range to be encoded
    clearly requires more than 8 bits per code point. I used
    self-delimiting numeric values [1] to represent code points up to U+007F
    in 8 bits, up to U+3FFF in 16 bits, and the rest in 24 bits. This
    doesn't perform so well with short examples in Ugaritic or Syloti Nagri,
    but as the message gets longer, the overhead of the tree becomes smaller
    as the number of discrete characters grows more slowly and finally hits
    a ceiling.

    It might be possible to improve on this storage scenario, by using a
    SCSU-like sliding window or by encoding differences between symbols, if
    the order in which the tree will be read is known. In my experiments,
    this improved matters slightly for texts in small alphabets above
    U+4000, but made things worse for non-English Latin-based texts.

    > On a large alphabet like Unicode, this conversion table will have a
    > very significant size, unless you have a predetermined table, based on
    > fixed distributions, but then the compression will only be effective
    > on texts having the same code point frequencies as in the fixed
    > distribution.

    Predetermined tables are useful when the texts to be encoded are in the
    same language, and ideally of the same nature. For an experimental
    system such as mine, meant to show compression performance on any sort
    of Unicode text, it would clearly be the wrong approach.

    > Huffman coding is effective for example if you know the source
    > language and script of most of the texts you have to compress, but if
    > there are texts that have different distributions, the Huffman coding
    > may be MUCH larger.

    Any type of compression becomes more effective when more can be assumed
    about the input.

    > On the opposite, the SCSU algorithm that was adopted by Unicode with a
    > formal definition for general use, does not depend on language and
    > script choice (at least with almost all modern languages written only
    > with characters that are part of the current code point ranges when
    > the SCSU algorithm was finalized).

    You don't have to sell me on SCSU; I'm a true believer (as I wrote in my
    previous message). But one-size-fits-all compression solutions are hard
    to come by. Most commercial compression tools can fall back on a
    secondary algorithm, or even straight storage with no compression, if
    the primary algorithm is not effective. In the case of SCSU, its
    Achilles heel lies with short-alphabet text spread across multiple
    128-character half-blocks, such as Vietnamese and Ethiopic and Canadian
    Syllabics.

    > And it does not depend on first building a conversion table or not
    > even some statistics from a significant part of the text to encode, so
    > the SCSU algorithm can be applied "in the fly", i.e. without prior
    > knowledge about the text and in a single pass, with a very small
    > encoding state (of just one code point). There may be a few cases
    > where a few other look-ahead code points may be useful to optimize the
    > result further, but experiments have shown that the gain is very
    > modest, and not always worth the effort (in terms of code complexity
    > and encoding throughput measured in encoded megabytes per second on
    > the media where the compressed byte-stream will be transported or
    > stored).

    I assume you're telling other people. I wrote about all this in UTN #14
    three years ago.

    > SCSU has another advantage: it is fully deterministic

    Not true. As you just said yourself, encoders that look ahead farther
    can achieve modestly better (i.e. different) results.

    > and not more complex if the encoded stream was optimized or not, and
    > the worst case does have a very small and strictly limited overhead
    > face to a uncompressed stream; this makes SCSU even more interesting
    > for permanent storage of large volumes of texts, to increase the
    > effective throughput of the decoded stream when retrieving it, and to
    > reduce the average access time, due to a higher hit rate in data
    > caches (for example in large relational databases).

    I agree fully about the ease of decompressing SCSU, and said so in UTN
    #14. (Microsoft didn't take my advice about supporting SCSU in Internet
    Explorer.)

    > But now we look at this LZ-like algorithm; remember that this is just
    > a technical note, for information purpose, it is provided mostly as a
    > call for comments, for improvement; it just describes some principles
    > in a public document with a copyright assigned to the author and to
    > Unicode, possibly as a proof of prior art to protect the principle
    > against the possible development of an aggressive patent claim, for
    > what could become later an approved algorithm

    There's obviously plenty of prior art surrounding LZ-type compression,
    especially a simplified version such as this one.

    > and it is made as a possible response to the possible patent claims on
    > BOCU-1 or SCSU.

    SCSU does not belong in this discussion. It was derived from a simpler
    scheme (RCSU) developed at Reuters and its publication as a freely
    available UTR was led by two Reuters employees. It has been around for
    10 years now and nobody has attempted to claim any IPR on it or restrict
    its use. (BOCU-1, of course, is another matter.)

    > In fact, the conclusion of this document should consider comparing the
    > (still unnamed) LZ-like algorithm against SCSU and BOCU-1. I consider
    > it as a viable alternative, because the decompression algorithm is
    > fully deterministic too, and very straightforward to implement, as
    > demonstrated in the sample code.

    I'd love to see it, and will probably conduct my own experiments with it
    as well. I have no great criticism to make against the UTN #31
    approach; as I said, I just discovered the UTN this past weekend.

    [1]
    ftp://ftp.rfc-editor.org/in-notes/internet-drafts/draft-eddy-dtn-sdnv-02.txt

    --
    Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
    http://users.adelphia.net/~dewell/
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages
    


    This archive was generated by hypermail 2.1.5 : Tue May 08 2007 - 01:21:49 CDT