Re: UTN #31 and direct compression of code points

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Mon May 07 2007 - 12:25:14 CDT

  • Next message: John Hudson: "Re: Uppercase ß is coming? (U+1E9E)"

    Philippe Verdy wrote on Monday, May 07, 2007 7:27 AM
    Subject: RE: UTN #31 and direct compression of code points

    > The rationale with the choice of 16-bit code units is explained by the
    > nature of code matches, i.e. their average size, and how we can represent
    > them efficiently. Compressing 32-bit units would require re-encoding the
    > code matches on larger bit-fields, as well as to increase the size of the
    > lookup dictionary to an unreasonable and unbounded limit.

    The size of the lookup dictionary is a user choice. The examples were run
    with a capacity of 2^14 character pairs - all one would need for compressing
    ASCII-only data. The argument for UTF-16 would be the more restricted one
    that needs to keep all the input and output in memory. Obviously this is
    just a simplification, but restricting the memory would slow the program
    down.

    > You seem to forget that such Huffman coding requires not only storing the
    > bitstreams representing each compressed code point, but also the table
    > that
    > will be needed to decode the bit-stream. On a large alphabet like Unicode,
    > this conversion table will have a very significant size,...

    That entirely depends on how one stores the table. One need only store the
    entries for the characters that occur in the text. One can also have a
    dynamic table that is automatically updated as the data is encoded or
    decoded, though that must be slower for decoding and will encode less
    efficiently.

    Richard.



    This archive was generated by hypermail 2.1.5 : Mon May 07 2007 - 12:27:00 CDT