Re: Unicode & space in programming & l10n

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Sep 21 2006 - 00:01:46 CDT

Next message: Doug Ewell: "Re: Unicode & space in programming & l10n"

Previous message: William J Poser: "Re: support for full unicode"
In reply to: Hans Aberg: "Re: Unicode & space in programming & l10n"
Next in thread: Asmus Freytag: "Re: Unicode & space in programming & l10n"
Reply: Asmus Freytag: "Re: Unicode & space in programming & l10n"
Reply: Hans Aberg: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg <haberg at math dot su dot se> wrote:

> Relative to that stuff, I suggest to compress the character data, as
> represented by the code points, rather any character encoded data.
> Typically, a compression method build a binary encoding based on a
> statistical analysis of a sequence of data units. So if applied to the
> character data, there results a character encoding from such a
> compression. Conversely, any character encoding can be viewed as a
> compression method with certain statistical properties.

Different compression methods work in different ways. Certainly, a
compression method that is specifically designed for Unicode text can
take advantage of the unique properties of Unicode text, as compared to,
say, photographic images.

I've often suspected that a Huffman or arithmetic encoder that encoded
Unicode code points directly would perform better than a byte-based one
working with UTF-8 code units. I haven't done the math to prove it,
though.

> When compressing character encoded data, one first translates it into
> character data, and compresses that. So it does then not matter which
> character encoding originally is used in the input, as the character
> data will be the same: the final compression need only to include the
> additional information about what was the original character encoding
> to restore data.

Actually, it does matter for some compression methods, such as the
well-known LZW. Burrows-Wheeler is fairly unusual in this regard.

--
Doug Ewell
Fullerton, California, USA
http://users.adelphia.net/~dewell/
RFC 4645  *  UTN #14

Next message: Doug Ewell: "Re: Unicode & space in programming & l10n"
Previous message: William J Poser: "Re: support for full unicode"
In reply to: Hans Aberg: "Re: Unicode & space in programming & l10n"
Next in thread: Asmus Freytag: "Re: Unicode & space in programming & l10n"
Reply: Asmus Freytag: "Re: Unicode & space in programming & l10n"
Reply: Hans Aberg: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 00:12:58 CDT