Re: Unicode & space in programming & l10n

From: Asmus Freytag (
Date: Thu Sep 21 2006 - 01:13:36 CDT

  • Next message: Jukka K. Korpela: "Re: Question about formatting numerals"

    On 9/20/2006 10:01 PM, Doug Ewell wrote:
    > Hans Aberg <haberg at math dot su dot se> wrote:
    >> Relative to that stuff, I suggest to compress the character data, as
    >> represented by the code points, rather any character encoded data.
    >> Typically, a compression method build a binary encoding based on a
    >> statistical analysis of a sequence of data units. So if applied to
    >> the character data, there results a character encoding from such a
    >> compression. Conversely, any character encoding can be viewed as a
    >> compression method with certain statistical properties.
    > Different compression methods work in different ways. Certainly, a
    > compression method that is specifically designed for Unicode text can
    > take advantage of the unique properties of Unicode text, as compared
    > to, say, photographic images.
    > I've often suspected that a Huffman or arithmetic encoder that encoded
    > Unicode code points directly would perform better than a byte-based
    > one working with UTF-8 code units. I haven't done the math to prove
    > it, though.
    You missed attending the very *first* Unicode Implementers Workshop
    where one of the presentations did precisely that kind of math. If you
    assume a large alphabet, then your compression gets worse, even if the
    actual number of elements are few. SCSU and similar algorithms reduce
    the effective alphabet to +-127, which is much less than 0x10FFFF+1.

    There is information in *where* in a large alphabet your subset resides.
    If you don't model that well, or use a scheme that happens to model this
    situation well, you pay a price of a few fractional bits per character -
    almost 1 if you use an 8-bit Huffman on 16-bit character data.
    >> When compressing character encoded data, one first translates it into
    >> character data, and compresses that. So it does then not matter which
    >> character encoding originally is used in the input, as the character
    >> data will be the same: the final compression need only to include the
    >> additional information about what was the original character encoding
    >> to restore data.
    > Actually, it does matter for some compression methods, such as the
    > well-known LZW. Burrows-Wheeler is fairly unusual in this regard.
    > --
    > Doug Ewell
    > Fullerton, California, USA
    > RFC 4645 * UTN #14

    This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 01:19:00 CDT