UTN #31 and direct compression of code points

From: Doug Ewell (dewell@adelphia.net)
Date: Sun May 06 2007 - 20:28:53 CDT

  • Next message: Philippe Verdy: "RE: UTN #31 and direct compression of code points"

    I just discovered UTN #31, "Fast Compression Algorithm for Unicode
    Text," which describes a simplified Lempel-Ziv-type approach that is
    applied directly to "16 bit Unicode characters," unlike most LZ and
    other compression algorithms which operate on bytes.

    Although "16 bit" still isn't quite right -- there's more to Unicode
    than the BMP -- the approach of compressing Unicode code points directly
    was especially interesting to me, because I've wondered for a long time
    about the statement in the Unicode FAQ on this very topic:

    "To get the same effect [as SCSU] with one of the popular general
    purpose algorithms, like Huffman or any of the variants of Lempel-Ziv
    compression, it would have to be retargeted to 16-bit, losing
    effectiveness due to the larger alphabet size. It's relatively easy to
    work out the math for the Huffman case to show how many extra bits the
    compressed text would need just because the alphabet was larger.
    Similar effects exist for LZW."

    This puzzles me because in Huffman coding, the symbols (characters) that
    *don't* appear in the message to be encoded don't appear in the Huffman
    tree, and should have no impact on the encoded message. A message of,
    say, 50 discrete characters, whether drawn from the 128-character ASCII
    range or the million-character Unicode range, should have the same
    entropy. The representation of the character in the Huffman tree might
    be larger if the "alphabet" is larger, but this cost should be borne
    only *once*, in the tree, not every time the character is encoded.

    Apparently there was a presentation at the first Unicode Implementers'
    Workshop, back in the early '90s, where this claim was made, and it has
    been accepted ever since.

    I've implemented a Huffman encoder that operates directly on Unicode
    code points, and it does a noticeably better job of compressing the data
    than if it were applied to the UTF-8 or UTF-16 code units. Even in
    contrived cases, like using combinations of the same 4 UTF-8 code units
    to generate 16 different characters, the direct approach works better
    (except for the tree-storage overhead, which usually makes Huffman
    inappropriate for such short messages anyway). The reason is that the
    increase in entropy is more than made up by the reduced number of
    encoded symbols.

    I'd like to know if anyone has done similar experiments, and how they
    turned out, and if anyone is interested in more details of my
    experiment.

    None of this is meant as a criticism of SCSU, in which I am still a big
    believer.

    --
    Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
    http://users.adelphia.net/~dewell/
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages
    


    This archive was generated by hypermail 2.1.5 : Sun May 06 2007 - 20:31:07 CDT