RE: UTN #31 and direct compression of code points

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon May 07 2007 - 02:53:17 CDT

Next message: Philippe Verdy: "RE: Uppercase=?iso-8859-1?q?=DFiscoming?(U+1E9E)"

Previous message: Philippe Verdy: "RE: UTN #31 and direct compression of code points"
In reply to: Doug Ewell: "UTN #31 and direct compression of code points"
Next in thread: Richard Wordingham: "Re: UTN #31 and direct compression of code points"
Reply: Richard Wordingham: "Re: UTN #31 and direct compression of code points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Doug Ewell wrote:
>
> I just discovered UTN #31, "Fast Compression Algorithm for Unicode
> Text," which describes a simplified Lempel-Ziv-type approach that is
> applied directly to "16 bit Unicode characters," unlike most LZ and
> other compression algorithms which operate on bytes.

In fact I am a bit puzzled by the comment on the second line of the sample
code below:
length = DecodeLength(&input);
offset = DecodeOffset(&input); // same algorithm as DecodeLength

This does not look correct, because only the length of a match or literal
will encoded with bit 7 of the first byte (in *input) indicating whether it
is a match or a literal, and bit 6 of the first byte will indicate the
presence of further bits for the encoded length, but then the next byte can
contain the 1-bit continuation flag and the extra 7 bits of the length value

For encoding the offset (in matches only), what is the use of bits 7 and 6?
Couldn't we store up to 7 bits of the offset value (instead of 6 bits) in
the same byte without requiring an extra byte?

If so, the two functions DecodeLength() and DecodeOffset() need to be
different.

Also the figure 3 shows 0041 15 0C FD 78 for the output, but should better
show 41 00 15 0C FD 78, to exhibit the fact that the first code unit is
stored in little endian format.

However I wonder if the choice of the fixed size little-endian 16-bit format
for the first character in a literal is appropriate. Why couldn't it
represented like a code points difference as used in the rest of the
literal?
==> All you need is to define the value of the code point which is present
in the original text before the stored literal, or to assume it is U+0000 at
the beginning of the text.
In that case, you don't have to assume any code unit size for the first
character of each literal. But you need a function to encode/decode
character differences into/from the byte-stream; as codepoints can be within
U+0000 and U+10FFFF, the codepoint differences can be within
[-10FFFF..+10FFFF], so these differences will need up to 22 bits in the
worst case, and will benefit of the variable-length encoding of signed
integers, like in DecodeLength().

Storing signed integers with a variable-length encoding should be described
more formally. If we need 1 bit in each byte for storing a continuation flag
for the variable-length encoding, this leaves 7 bits for storing signed
values in range -128..+127, but if the negative integer to encode is lower
than -128, the sign padding extension must be explained, because the
representation is in little endian order, so the first 7 bits of the
negative integer represented will be the lest significant ones, and bit 6
will no longer be the sign bit.

The sign bit will be in bit 6 of the last byte of the sequence, it would be
simpler to manage if we used a big-endian representation there, so it would
be more appropriate to use a big-endian representation for the encoding of
(always positive) length and offset values.

Finally, the document does not show explicitly how length and offset values
are converted into byte streams (not even in the sample code in figures 6/7
for the compression, or figure 8 for decompression). This means that the
algorithm explained is still not a complete definition of an interoperable
algorithm:

Do you consider the case where a length equal to 0 has no useful function,
as well as an offset equal to 0 which would be clearly invalid because it
would span past the position of the current code point to encode?
If so, there are invalid (length, offset) pairs for matches and the
effective encoding in the byte stream could be enhanced by extending for
example the offsets without changing the length.

My opinion is that you should study how to encode effectively only the valid
(length, offset) pairs as we should always have:
* length > 1 (otherwise this is not a useful match, we are in a literal)
* offset > 0 (otherwise we encode past the end of current input)
* length <= offset (otherwise we encode past the end of current input)
i.e. length must be in [2..offset] in valid matches

For example the (length, offset)=(0,0) pair is invalid, the smallest pair is
(2, 2), i.e. a match for repeating the last two characters of the input.

If you put these pairs in a grid diagram, you get less than an half-filled
grid, and there are certainly some ways to study how to represent each pair
using a single variable-length coding for the most frequent pairs with an
heuristic model, instead of encoding them independently.

Next message: Philippe Verdy: "RE: Uppercase=?iso-8859-1?q?=DFiscoming?(U+1E9E)"
Previous message: Philippe Verdy: "RE: UTN #31 and direct compression of code points"
In reply to: Doug Ewell: "UTN #31 and direct compression of code points"
Next in thread: Richard Wordingham: "Re: UTN #31 and direct compression of code points"
Reply: Richard Wordingham: "Re: UTN #31 and direct compression of code points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon May 07 2007 - 02:56:13 CDT