Re: Least used parts of BMP.

From: Mark Davis ☕ (mark@macchiato.com)
Date: Wed Jun 02 2010 - 22:12:16 CDT

  • Next message: Michael D'Errico: "Re: Least used parts of BMP."

    An alternative that I've used is:

       - Serialize every unsigned integer as a sequence of 7 bits, with the top
       bit off for all but the last one.
       - For signed integers, shift left by 1 bit, then invert if the original
       was negative, then serialize as unsigned.
       - Serialize a string as an integer length followed by a sequence of code
       points expressed as integer deltas.
          - For the deltas, set Previous=0 and loop, where each delta = current
          - (Previous with the last 6 bits set to 0x40).
       - Serialize floats/doubles as an integer exponent, then the sign+mantissa
       (but in reverse byte order, eg MSF).

    This tends to produce pretty reasonable compression given that it is very
    simple code and a fast transform.

    Mark

    — Il meglio è l’inimico del bene —

    On Wed, Jun 2, 2010 at 18:11, Kannan Goundan <kannan@cakoose.com> wrote:

    > Thanks to everyone for the detailed responses. I definitely
    > appreciate the feedback on the broader issue (even though my question
    > was very narrow).
    >
    > I should clarify my use case a little. I'm creating a generic data
    > serialization format similar to Google Protocol Buffers and Apache
    > Thrift. Other than Unicode strings, the format supports many other
    > data types -- all of which are serialized in a custom format. Some
    > data types will contain a lot of string data while others will contain
    > very little. As is the case with other tools in this area, standard
    > compression techniques can be applied to the entire payload as a
    > separate pass (e.g. gzip).
    >
    > I can see how there are benefits to using one of the standard
    > encodings. However, at this point, my goals are basically fast
    > serialization/deserialization and small size. I might eventually see
    > the error in my ways (and feel like an idiot for ignoring your
    > advice), but in the interest of not wasting your time any more than I
    > already have, I should mention that suggestions to stick to a standard
    > encoding will fall on mostly deaf ears.
    >
    > For my current use case, I don't need to perform random accesses in
    > serialized data so I don't see a need to make the space-usage
    > compromises that UTF-8 and UTF-16 make. A more compact UTF-8-like
    > encoding will get you ASCII with one byte, the first 1/4 of the BMP
    > with two bytes, and everything else with three bytes. A more compact
    > UTF-16-like format gets the BMP in 2 bytes (minus some PUA) and
    > everything else in 3. Maybe not huge savings, but if you're of the
    > opinion that sticking to a standard doesn't buy you anything... :-)
    >
    > I'll definitely take a closer look at SCSU. Hopefully the encoding
    > speed is good enough. Most of the other serialization tools just
    > blast out UTF-8, making them very fast on strings that contain mostly
    > ASCII. I hope SCSU doesn't get me killed in ASCII-only encoding
    > benchmarks (http://wiki.github.com/eishay/jvm-serializers/). I really
    > do like the idea of making my format less ASCII-biased, though. And,
    > like I said before, I don't care much about sticking to a standard
    > encoding -- if stock SCSU ends up being too slow or complex, I might
    > still be able to use techniques from SCSU in a custom encoding.
    >
    > (Philippe: when I said I needed 20 bits, I meant that I needed 20 bits
    > for the stuff after the BMP. I fully intend for my encoding to handle
    > every Unicode codepoint, minus surrogates.)
    >
    > Thanks again, everyone.
    > -- Kannan
    >
    > On Wed, Jun 2, 2010 at 13:12, Asmus Freytag <asmusf@ix.netcom.com> wrote:
    > > On 6/2/2010 12:25 AM, Kannan Goundan wrote:
    > >>
    > >> On Tue, Jun 1, 2010 at 23:30, Asmus Freytag <asmusf@ix.netcom.com>
    > wrote:
    > >>
    > >>>
    > >>> Why not use SCSU?
    > >>>
    > >>> You get the small size and the encoder/decoder aren't that
    > >>> complicated.
    > >>>
    > >>
    > >> Hmm... I had skimmed the SCSU document a few days ago. At the time it
    > >> seemed a bit more complicated than I wanted. What's nice about UTF-8
    > >> and UTF-16-like encodings is that the space usage is predictable.
    > >>
    > >> But maybe I'll take a closer look. If a simple SCSU encoder can do
    > >> better than more "standard" encodings 99% of the time, then maybe it's
    > >> worth it...
    > >>
    > >>
    > >
    > > It will, because it's designed to compress commonly used characters.
    > >
    > > Start with the existing sample code and optimize it. Many features of
    > SCSU
    > > are optional, using them gives slightly better compression, but you don't
    > > always have to use them and the result is still legal SCSU. Sometimes
    > > leaving out a feature can make your encoder a tad simpler, although I
    > found
    > > that you can be pretty fast with decent performance.
    > >
    > > A./
    > >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Jun 02 2010 - 22:15:42 CDT