Re: Compression and Unicode [was: Name Compression]

From: Mark Davis (markdavis@ispchannel.com)
Date: Thu May 11 2000 - 01:56:08 EDT


SCSU has a different design point. It is specifically architected to work well for small, independent pieces of text. You can use it, for example, to encode individual fields in a database without sacrificing random access to those fields.

I suspect those who evolved mechanisms for compressing Unicode names were working under similar constraints. The goal would be minimize memory requirements (not just disk), maintain fast random access, yet have simple, maintainable code. If the patent-free variants of LZW can accomplish this, then I'm sure a lot of people would be eager to get more details about those variants from you.

[BTW, it was also our experience that with larger files, compressing with SCSU then compressing with LZW produced better compression than LZW alone.]

Mark

Juliusz Chroboczek wrote:

> mohrin@sharmahd.com (Torsten Mohrin) writes:
>
> TM> In SC UniPad we use a compressed name table. The names are compressed
> TM> by encoding the words either in one or two bytes. The separators
> TM> (space and hyphen-minus) are encoded in a special way. It works as
> TM> follows:
>
> [explanation snipped]
>
> Why not use Huffman encoding? You could precompute the Huffman tables
> once and for all, compile them into your program, and only do the
> actual encoding/decoding at runtime.
>
> It would be a little bit more computationally expensive than your
> scheme due to the need to access parts of bytes, but would yield a
> much better compression ratio.
>
> More generally, I get the impression that the Unicode community is
> particularly keen on inventing /ad hoc/ compression schemes. I still
> haven't heard a sound rationale for the existence of the SCCS. What's
> wrong with patent-free variants of LZW?
>
> J.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT