Re: UTN #31 and direct compression of code points

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Mon May 07 2007 - 12:25:14 CDT

Next message: John Hudson: "Re: Uppercase ß is coming? (U+1E9E)"

Previous message: Richard Wordingham: "Re: UTN #31 and direct compression of code points"
In reply to: Philippe Verdy: "RE: UTN #31 and direct compression of code points"
Next in thread: Doug Ewell: "Re: UTN #31 and direct compression of code points"
Reply: Doug Ewell: "Re: UTN #31 and direct compression of code points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy wrote on Monday, May 07, 2007 7:27 AM
Subject: RE: UTN #31 and direct compression of code points

> The rationale with the choice of 16-bit code units is explained by the
> nature of code matches, i.e. their average size, and how we can represent
> them efficiently. Compressing 32-bit units would require re-encoding the
> code matches on larger bit-fields, as well as to increase the size of the
> lookup dictionary to an unreasonable and unbounded limit.

The size of the lookup dictionary is a user choice. The examples were run
with a capacity of 2^14 character pairs - all one would need for compressing
ASCII-only data. The argument for UTF-16 would be the more restricted one
that needs to keep all the input and output in memory. Obviously this is
just a simplification, but restricting the memory would slow the program
down.

> You seem to forget that such Huffman coding requires not only storing the
> bitstreams representing each compressed code point, but also the table
> that
> will be needed to decode the bit-stream. On a large alphabet like Unicode,
> this conversion table will have a very significant size,...

That entirely depends on how one stores the table. One need only store the
entries for the characters that occur in the text. One can also have a
dynamic table that is automatically updated as the data is encoded or
decoded, though that must be slower for decoding and will encode less
efficiently.

Richard.

Next message: John Hudson: "Re: Uppercase ß is coming? (U+1E9E)"
Previous message: Richard Wordingham: "Re: UTN #31 and direct compression of code points"
In reply to: Philippe Verdy: "RE: UTN #31 and direct compression of code points"
Next in thread: Doug Ewell: "Re: UTN #31 and direct compression of code points"
Reply: Doug Ewell: "Re: UTN #31 and direct compression of code points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon May 07 2007 - 12:27:00 CDT