Re: Unicode & space in programming & l10n

From: Doug Ewell (
Date: Thu Sep 21 2006 - 07:26:40 CDT

  • Next message: Hans Aberg: "Re: Unicode & space in programming & l10n"

    Hans Aberg <haberg at math dot su dot se> wrote:

    > One way to do a character compression is to simply do a frequency
    > analysis, sort the characters according to that, which gives a map
    > code points -> code points. Then apply a variable width character
    > encoding which gives smaller width to smaller non-negative integers,
    > like say UTF-8, to that. Here, the compression method cannot do worse
    > than UTF-8.

    You mean, do Huffman encoding, but with bytes as the basic code unit
    instead of bits?

    Don't forget you need to store the frequency table along with the
    compressed data, so the reader can reconstruct the table. That could
    mitigate your compression somewhat.

    Doug Ewell
    Fullerton, California, USA
    RFC 4645  *  UTN #14

    This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 07:30:33 CDT