Re: Unicode & space in programming & l10n

From: Hans Aberg (haberg@math.su.se)
Date: Thu Sep 21 2006 - 08:19:28 CDT

Next message: Hans Aberg: "Re: Unicode & space in programming & l10n"

Previous message: Doug Ewell: "Re: Unicode & space in programming & l10n"
In reply to: Doug Ewell: "Re: Unicode & space in programming & l10n"
Next in thread: Doug Ewell: "Re: Unicode & space in programming & l10n"
Reply: Doug Ewell: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 21 Sep 2006, at 14:26, Doug Ewell wrote:

>> One way to do a character compression is to simply do a frequency
>> analysis, sort the characters according to that, which gives a map
>> code points -> code points. Then apply a variable width character
>> encoding which gives smaller width to smaller non-negative
>> integers, like say UTF-8, to that. Here, the compression method
>> cannot do worse than UTF-8.
>
> You mean, do Huffman encoding, but with bytes as the basic code
> unit instead of bits?

It is similar to what I indicated. It is just a general scheme that
comes to my mind. :-)

One can think of different methods, depending on what properties one
wants the encoded text to have. So UTF-like encodings have the
advantages that falling into bytes, and that multibytes boundaries
can be always resynchronized, but gives less compression than a bit
scheme. One can use a code point to bit-map encoding to get better
compression.

Another method, which enables compressing both characters (code
points) and natural language words (sequences of code points), might
be to make modified UTF-8, where the leading byte admits indicating
two categories of numbers. (Continued below.)

> Don't forget you need to store the frequency table along with the
> compressed data, so the reader can reconstruct the table. That
> could mitigate your compression somewhat.

I am basically ignoring the problem of the table size. In some
methods, the tables size will be crucial, of course.

But one might do a statistical analysis of common natural words of a
large text body, and based on that, give more common words lower
numbers. This results in a fixed translation table, suitable for
general purpose texts. Then one applies the hybrid encoding above. As
the natural word translation table will be the same for different
compressed texts, it will not matter if it is large, as it can be
implemented as a library.

Hans Aberg

Next message: Hans Aberg: "Re: Unicode & space in programming & l10n"
Previous message: Doug Ewell: "Re: Unicode & space in programming & l10n"
In reply to: Doug Ewell: "Re: Unicode & space in programming & l10n"
Next in thread: Doug Ewell: "Re: Unicode & space in programming & l10n"
Reply: Doug Ewell: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 08:23:03 CDT