From: Mark Davis (email@example.com)
Date: Wed Sep 20 2006 - 16:25:53 CDT
I strongly suspect that all of that would give only minor advantages over
general-purpose algorithms like ZIP. But this is all academic -- I don't see
anyone taking the time and effort to investigate it in the absence of a
On 9/20/06, Hans Aberg <firstname.lastname@example.org> wrote:
> On 20 Sep 2006, at 04:14, Doug Ewell wrote:
> > Hans Aberg <haberg at math dot su dot se> wrote:
> >> It is probably more efficient to translate the stream into code
> >> points and then use a compression technique on that, because then
> >> the full character structure is taken into account. Then it does
> >> not matter which character encoding is used.
> > If you have not yet read Unicode Technical Note #14, particularly
> > the sections on "general-purpose compression" and "two-layer
> > compression," you might wish to do so.
> Relative to that stuff, I suggest to compress the character data, as
> represented by the code points, rather any character encoded data.
> Typically, a compression method build a binary encoding based on a
> statistical analysis of a sequence of data units. So if applied to
> the character data, there results a character encoding from such a
> compression. Conversely, any character encoding can be viewed as a
> compression method with certain statistical properties.
> When compressing character encoded data, one first translates it into
> character data, and compresses that. So it does then not matter which
> character encoding originally is used in the input, as the character
> data will be the same: the final compression need only to include the
> additional information about what was the original character encoding
> to restore data.
> There is the problem of large translation tables. But that belongs to
> the chapter of table compression, or alternatively, one can use a aet
> of character encodings that, though not providing the most efficient
> compression, may admit compact translation functions. On the other
> hand, a translation table of just a hundred thousand characters is
> not so big anymore in todays computers.
> And one can go further, doing a statistical analysis on typical text
> in the different languages, identifying words, and their typical
> statistical frequencies. A compression would then identify common
> words, suitable for compression, and give them one entry in the
> translation table.
> Hans Aberg
This archive was generated by hypermail 2.1.5 : Wed Sep 20 2006 - 16:34:01 CDT