# Re: Unicode & space in programming & l10n

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Sep 21 2006 - 01:13:36 CDT

• Next message: Jukka K. Korpela: "Re: Question about formatting numerals"

On 9/20/2006 10:01 PM, Doug Ewell wrote:
> Hans Aberg <haberg at math dot su dot se> wrote:
>
>> Relative to that stuff, I suggest to compress the character data, as
>> represented by the code points, rather any character encoded data.
>> Typically, a compression method build a binary encoding based on a
>> statistical analysis of a sequence of data units. So if applied to
>> the character data, there results a character encoding from such a
>> compression. Conversely, any character encoding can be viewed as a
>> compression method with certain statistical properties.
>
> Different compression methods work in different ways. Certainly, a
> compression method that is specifically designed for Unicode text can
> take advantage of the unique properties of Unicode text, as compared
> to, say, photographic images.
>
> I've often suspected that a Huffman or arithmetic encoder that encoded
> Unicode code points directly would perform better than a byte-based
> one working with UTF-8 code units. I haven't done the math to prove
> it, though.
You missed attending the very *first* Unicode Implementers Workshop
where one of the presentations did precisely that kind of math. If you
assume a large alphabet, then your compression gets worse, even if the
actual number of elements are few. SCSU and similar algorithms reduce
the effective alphabet to +-127, which is much less than 0x10FFFF+1.

There is information in *where* in a large alphabet your subset resides.
If you don't model that well, or use a scheme that happens to model this
situation well, you pay a price of a few fractional bits per character -
almost 1 if you use an 8-bit Huffman on 16-bit character data.
>
>> When compressing character encoded data, one first translates it into
>> character data, and compresses that. So it does then not matter which
>> character encoding originally is used in the input, as the character
>> data will be the same: the final compression need only to include the
>> to restore data.
>
> Actually, it does matter for some compression methods, such as the
> well-known LZW. Burrows-Wheeler is fairly unusual in this regard.
>
> --
> Doug Ewell
> Fullerton, California, USA