Re: Unicode & space in programming & l10n

From: Hans Aberg (haberg@math.su.se)
Date: Wed Sep 20 2006 - 12:53:14 CDT

  • Next message: Mark Davis: "Re: Unicode & space in programming & l10n"

    On 20 Sep 2006, at 04:14, Doug Ewell wrote:

    > Hans Aberg <haberg at math dot su dot se> wrote:
    >
    >> It is probably more efficient to translate the stream into code
    >> points and then use a compression technique on that, because then
    >> the full character structure is taken into account. Then it does
    >> not matter which character encoding is used.
    >
    > If you have not yet read Unicode Technical Note #14, particularly
    > the sections on "general-purpose compression" and "two-layer
    > compression," you might wish to do so.

    Relative to that stuff, I suggest to compress the character data, as
    represented by the code points, rather any character encoded data.
    Typically, a compression method build a binary encoding based on a
    statistical analysis of a sequence of data units. So if applied to
    the character data, there results a character encoding from such a
    compression. Conversely, any character encoding can be viewed as a
    compression method with certain statistical properties.

    When compressing character encoded data, one first translates it into
    character data, and compresses that. So it does then not matter which
    character encoding originally is used in the input, as the character
    data will be the same: the final compression need only to include the
    additional information about what was the original character encoding
    to restore data.

    There is the problem of large translation tables. But that belongs to
    the chapter of table compression, or alternatively, one can use a aet
    of character encodings that, though not providing the most efficient
    compression, may admit compact translation functions. On the other
    hand, a translation table of just a hundred thousand characters is
    not so big anymore in todays computers.

    And one can go further, doing a statistical analysis on typical text
    in the different languages, identifying words, and their typical
    statistical frequencies. A compression would then identify common
    words, suitable for compression, and give them one entry in the
    translation table.

       Hans Aberg



    This archive was generated by hypermail 2.1.5 : Wed Sep 20 2006 - 13:11:54 CDT