Re: Unicode & space in programming & l10n

From: Mark Davis (mark.davis@icu-project.org)
Date: Wed Sep 20 2006 - 16:25:53 CDT

  • Next message: Addison Phillips: "Re: Question about formatting numerals"

    I strongly suspect that all of that would give only minor advantages over
    general-purpose algorithms like ZIP. But this is all academic -- I don't see
    anyone taking the time and effort to investigate it in the absence of a
    compelling need.

    Mark

    On 9/20/06, Hans Aberg <haberg@math.su.se> wrote:
    >
    >
    > On 20 Sep 2006, at 04:14, Doug Ewell wrote:
    >
    > > Hans Aberg <haberg at math dot su dot se> wrote:
    > >
    > >> It is probably more efficient to translate the stream into code
    > >> points and then use a compression technique on that, because then
    > >> the full character structure is taken into account. Then it does
    > >> not matter which character encoding is used.
    > >
    > > If you have not yet read Unicode Technical Note #14, particularly
    > > the sections on "general-purpose compression" and "two-layer
    > > compression," you might wish to do so.
    >
    > Relative to that stuff, I suggest to compress the character data, as
    > represented by the code points, rather any character encoded data.
    > Typically, a compression method build a binary encoding based on a
    > statistical analysis of a sequence of data units. So if applied to
    > the character data, there results a character encoding from such a
    > compression. Conversely, any character encoding can be viewed as a
    > compression method with certain statistical properties.
    >
    > When compressing character encoded data, one first translates it into
    > character data, and compresses that. So it does then not matter which
    > character encoding originally is used in the input, as the character
    > data will be the same: the final compression need only to include the
    > additional information about what was the original character encoding
    > to restore data.
    >
    > There is the problem of large translation tables. But that belongs to
    > the chapter of table compression, or alternatively, one can use a aet
    > of character encodings that, though not providing the most efficient
    > compression, may admit compact translation functions. On the other
    > hand, a translation table of just a hundred thousand characters is
    > not so big anymore in todays computers.
    >
    > And one can go further, doing a statistical analysis on typical text
    > in the different languages, identifying words, and their typical
    > statistical frequencies. A compression would then identify common
    > words, suitable for compression, and give them one entry in the
    > translation table.
    >
    > Hans Aberg
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Sep 20 2006 - 16:34:01 CDT