RE: Compression and Unicode [was: Name Compression]

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sat May 13 2000 - 18:29:46 EDT


At 11:10 AM 5/12/00 +0100, Marco.Cimarosti@icl.com wrote:
>Asmus Freytag wrote:
> > I wonder whether he [Torsten] has measured the difference
> > in size from the effect of dedicating one bit to the
> > cause of SPACE vs HYPHEN. [...]
> > Hard to know how many words really show up in both forms.
>
>OK: I did the counting, hoping it is worth something. There are 76 such
>cases (see list in l_hs.txt). I also counted words that can be followed by
>"-" or occur at end of name: 68 cases, many of which are shared with
>previous list (see list in l_he.txt).
>
>I noticed that Torsten's scheme assumes that " " and "-" are mutually
>exclusive separators, but this is not true for a handful of Tibetan
>characters that have sequences like " -" or "- " (see list in l_xx.txt). How
>are these cases handled?

One wonders. (probably the -XXX is treated as a regular word starting with -).
The small number make assigning the extra bit seem unnecessary and expensive,
a better scheme would be to treat all words that end in - as not needing a
space following (and to add the "XXX- " cases with a space in the table.

Since this doubles the number of words that need only single bytes, it should
reduce the overall file considerably, even if the word table grows by a few
hundred bytes.

> > His word table could be further compressed.
>
>Do you refer to the fact that many words are identical to the trailing part
>of longer words?
>
>Unluckily, this cannot be exploited if the names are encoded as sequences of
>indexed to *words*, as in Torsten's scheme.

No, you would need a separate map from key to word entry, that's all.

>To take advantage of this feature of the data, the word indexes should be
>substituted by *character* indexes to the beginning of the name.
>
>This, however, would screw up Torsten's approach to assign the shortest
>possible ids to the most common words, in order to save space.
>
> > PS:) coders will be coders, they like to invent new coding schemes
>
>Yes, why using the same old standard wheels, when you can have such a good
>time reinventing your own model?

The techniques he's using, and the ones we are discussing here are standard
techniques in compressing fixed dictionaries, and such. Compression always
depends on a model (a shared assumption on what kinds of text one can
compress and how to exploit that) and then an arithmetic coding of the
elements under the model. There's nothing really wrong about being clever
about the model and the representation if the goals are different from the
standard tools (in this case his desire for faster scanning and access to
substrings).

A./



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT