Re: Unicode & space in programming & l10n

From: Hans Aberg (haberg@math.su.se)
Date: Thu Sep 21 2006 - 08:19:28 CDT

  • Next message: Hans Aberg: "Re: Unicode & space in programming & l10n"

    On 21 Sep 2006, at 14:26, Doug Ewell wrote:

    >> One way to do a character compression is to simply do a frequency
    >> analysis, sort the characters according to that, which gives a map
    >> code points -> code points. Then apply a variable width character
    >> encoding which gives smaller width to smaller non-negative
    >> integers, like say UTF-8, to that. Here, the compression method
    >> cannot do worse than UTF-8.
    >
    > You mean, do Huffman encoding, but with bytes as the basic code
    > unit instead of bits?

    It is similar to what I indicated. It is just a general scheme that
    comes to my mind. :-)

    One can think of different methods, depending on what properties one
    wants the encoded text to have. So UTF-like encodings have the
    advantages that falling into bytes, and that multibytes boundaries
    can be always resynchronized, but gives less compression than a bit
    scheme. One can use a code point to bit-map encoding to get better
    compression.

    Another method, which enables compressing both characters (code
    points) and natural language words (sequences of code points), might
    be to make modified UTF-8, where the leading byte admits indicating
    two categories of numbers. (Continued below.)

    > Don't forget you need to store the frequency table along with the
    > compressed data, so the reader can reconstruct the table. That
    > could mitigate your compression somewhat.

    I am basically ignoring the problem of the table size. In some
    methods, the tables size will be crucial, of course.

    But one might do a statistical analysis of common natural words of a
    large text body, and based on that, give more common words lower
    numbers. This results in a fixed translation table, suitable for
    general purpose texts. Then one applies the hybrid encoding above. As
    the natural word translation table will be the same for different
    compressed texts, it will not matter if it is large, as it can be
    implemented as a library.

       Hans Aberg



    This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 08:23:03 CDT