Re: Amount of Space

From: Doug Ewell (
Date: Tue Jul 17 2007 - 00:31:22 CDT

  • Next message: WuAllen: "unsubscribe"

    William J Poser <wjposer at ldc dot upenn dot edu> wrote:

    > If you only need certain ranges, you may be able to find an ad hoc
    > compression scheme that saves a lot of space. For example, if you need
    > a range that encodes as three or four bytes in UTF-8 and otherwise
    > only ASCII, you might save a lot of space simply by subtracting from
    > the base codepoint of the range from each codepoint and adding it
    > again on decompression. Depending on the case, the fact that a
    > particular code represents ASCII or the upper range could be indicated
    > either by markup or by downshifting by the base codepoint -128, so
    > that any codepoint above 127 would be in the non-ascii range.

    This is sort of a composite of the SCSU
    ( and BOCU-1
    ( approaches. Differential
    compression works well when encoding the differences is cheaper than any
    method of encoding the code points themselves, either directly or by
    using them as indices into windows (as SCSU does) or some other way.
    BOCU-1 defines a system of lead bytes and trail bytes that keeps the
    byte count down for typical cases.

    I built a Huffman encoder for Unicode text, and stored the tree using
    XOR differences between code points instead of using the code points
    themselves. This technique led to either slightly worse or
    substantially better results, depending on the language and script. So
    William has a point about using differences, but only testing can show
    whether the added compression justifies the effort.

    I suspect Daniel has something much simpler in mind, and in the end the
    best answer for him is probably to just use UTF-8.

    Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14

    This archive was generated by hypermail 2.1.5 : Tue Jul 17 2007 - 00:33:20 CDT