Re: Unicode & space in programming & l10n

From: Hans Aberg (haberg@math.su.se)
Date: Thu Sep 21 2006 - 06:45:53 CDT

  • Next message: Doug Ewell: "Re: Unicode & space in programming & l10n"

    On 21 Sep 2006, at 08:13, Asmus Freytag wrote:

    > If you assume a large alphabet, then your compression gets worse,
    > even if the actual number of elements are few.

    So why would that be? - In one compression method, one just makes a
    frequency analysis on the characters used, and encodes based on that.
    So table entries need only be for characters actually used.

    One way to do a character compression is to simply do a frequency
    analysis, sort the characters according to that, which gives a map
    code points -> code points. Then apply a variable width character
    encoding which gives smaller width to smaller non-negative integers,
    like say UTF-8, to that. Here, the compression method cannot do worse
    than UTF-8.

       Hans Aberg



    This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 06:48:26 CDT