RE: UTN #31 and direct compression of code points

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon May 07 2007 - 02:53:17 CDT

  • Next message: Philippe Verdy: "RE: Uppercase=?iso-8859-1?q?=DFiscoming?(U+1E9E)"

    Doug Ewell wrote:
    >
    > I just discovered UTN #31, "Fast Compression Algorithm for Unicode
    > Text," which describes a simplified Lempel-Ziv-type approach that is
    > applied directly to "16 bit Unicode characters," unlike most LZ and
    > other compression algorithms which operate on bytes.

    In fact I am a bit puzzled by the comment on the second line of the sample
    code below:
        length = DecodeLength(&input);
        offset = DecodeOffset(&input); // same algorithm as DecodeLength

    This does not look correct, because only the length of a match or literal
    will encoded with bit 7 of the first byte (in *input) indicating whether it
    is a match or a literal, and bit 6 of the first byte will indicate the
    presence of further bits for the encoded length, but then the next byte can
    contain the 1-bit continuation flag and the extra 7 bits of the length value

    For encoding the offset (in matches only), what is the use of bits 7 and 6?
    Couldn't we store up to 7 bits of the offset value (instead of 6 bits) in
    the same byte without requiring an extra byte?

    If so, the two functions DecodeLength() and DecodeOffset() need to be
    different.

    Also the figure 3 shows 0041 15 0C FD 78 for the output, but should better
    show 41 00 15 0C FD 78, to exhibit the fact that the first code unit is
    stored in little endian format.

    However I wonder if the choice of the fixed size little-endian 16-bit format
    for the first character in a literal is appropriate. Why couldn't it
    represented like a code points difference as used in the rest of the
    literal?
    ==> All you need is to define the value of the code point which is present
    in the original text before the stored literal, or to assume it is U+0000 at
    the beginning of the text.
    In that case, you don't have to assume any code unit size for the first
    character of each literal. But you need a function to encode/decode
    character differences into/from the byte-stream; as codepoints can be within
    U+0000 and U+10FFFF, the codepoint differences can be within
    [-10FFFF..+10FFFF], so these differences will need up to 22 bits in the
    worst case, and will benefit of the variable-length encoding of signed
    integers, like in DecodeLength().

    Storing signed integers with a variable-length encoding should be described
    more formally. If we need 1 bit in each byte for storing a continuation flag
    for the variable-length encoding, this leaves 7 bits for storing signed
    values in range -128..+127, but if the negative integer to encode is lower
    than -128, the sign padding extension must be explained, because the
    representation is in little endian order, so the first 7 bits of the
    negative integer represented will be the lest significant ones, and bit 6
    will no longer be the sign bit.

    The sign bit will be in bit 6 of the last byte of the sequence, it would be
    simpler to manage if we used a big-endian representation there, so it would
    be more appropriate to use a big-endian representation for the encoding of
    (always positive) length and offset values.

    Finally, the document does not show explicitly how length and offset values
    are converted into byte streams (not even in the sample code in figures 6/7
    for the compression, or figure 8 for decompression). This means that the
    algorithm explained is still not a complete definition of an interoperable
    algorithm:

    Do you consider the case where a length equal to 0 has no useful function,
    as well as an offset equal to 0 which would be clearly invalid because it
    would span past the position of the current code point to encode?
    If so, there are invalid (length, offset) pairs for matches and the
    effective encoding in the byte stream could be enhanced by extending for
    example the offsets without changing the length.

    My opinion is that you should study how to encode effectively only the valid
    (length, offset) pairs as we should always have:
    * length > 1 (otherwise this is not a useful match, we are in a literal)
    * offset > 0 (otherwise we encode past the end of current input)
    * length <= offset (otherwise we encode past the end of current input)
    i.e. length must be in [2..offset] in valid matches

    For example the (length, offset)=(0,0) pair is invalid, the smallest pair is
    (2, 2), i.e. a match for repeating the last two characters of the input.

    If you put these pairs in a grid diagram, you get less than an half-filled
    grid, and there are certainly some ways to study how to represent each pair
    using a single variable-length coding for the most frequent pairs with an
    heuristic model, instead of encoding them independently.



    This archive was generated by hypermail 2.1.5 : Mon May 07 2007 - 02:56:13 CDT