Re: Unicode & space in programming & l10n

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Sep 21 2006 - 00:01:46 CDT

  • Next message: Doug Ewell: "Re: Unicode & space in programming & l10n"

    Hans Aberg <haberg at math dot su dot se> wrote:

    > Relative to that stuff, I suggest to compress the character data, as
    > represented by the code points, rather any character encoded data.
    > Typically, a compression method build a binary encoding based on a
    > statistical analysis of a sequence of data units. So if applied to the
    > character data, there results a character encoding from such a
    > compression. Conversely, any character encoding can be viewed as a
    > compression method with certain statistical properties.

    Different compression methods work in different ways. Certainly, a
    compression method that is specifically designed for Unicode text can
    take advantage of the unique properties of Unicode text, as compared to,
    say, photographic images.

    I've often suspected that a Huffman or arithmetic encoder that encoded
    Unicode code points directly would perform better than a byte-based one
    working with UTF-8 code units. I haven't done the math to prove it,
    though.

    > When compressing character encoded data, one first translates it into
    > character data, and compresses that. So it does then not matter which
    > character encoding originally is used in the input, as the character
    > data will be the same: the final compression need only to include the
    > additional information about what was the original character encoding
    > to restore data.

    Actually, it does matter for some compression methods, such as the
    well-known LZW. Burrows-Wheeler is fairly unusual in this regard.

    --
    Doug Ewell
    Fullerton, California, USA
    http://users.adelphia.net/~dewell/
    RFC 4645  *  UTN #14
    


    This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 00:12:58 CDT