Re: Least used parts of BMP.

From: Kannan Goundan (
Date: Wed Jun 02 2010 - 20:11:18 CDT

  • Next message: Asmus Freytag: "Re: Least used parts of BMP."

    Thanks to everyone for the detailed responses. I definitely
    appreciate the feedback on the broader issue (even though my question
    was very narrow).

    I should clarify my use case a little. I'm creating a generic data
    serialization format similar to Google Protocol Buffers and Apache
    Thrift. Other than Unicode strings, the format supports many other
    data types -- all of which are serialized in a custom format. Some
    data types will contain a lot of string data while others will contain
    very little. As is the case with other tools in this area, standard
    compression techniques can be applied to the entire payload as a
    separate pass (e.g. gzip).

    I can see how there are benefits to using one of the standard
    encodings. However, at this point, my goals are basically fast
    serialization/deserialization and small size. I might eventually see
    the error in my ways (and feel like an idiot for ignoring your
    advice), but in the interest of not wasting your time any more than I
    already have, I should mention that suggestions to stick to a standard
    encoding will fall on mostly deaf ears.

    For my current use case, I don't need to perform random accesses in
    serialized data so I don't see a need to make the space-usage
    compromises that UTF-8 and UTF-16 make. A more compact UTF-8-like
    encoding will get you ASCII with one byte, the first 1/4 of the BMP
    with two bytes, and everything else with three bytes. A more compact
    UTF-16-like format gets the BMP in 2 bytes (minus some PUA) and
    everything else in 3. Maybe not huge savings, but if you're of the
    opinion that sticking to a standard doesn't buy you anything... :-)

    I'll definitely take a closer look at SCSU. Hopefully the encoding
    speed is good enough. Most of the other serialization tools just
    blast out UTF-8, making them very fast on strings that contain mostly
    ASCII. I hope SCSU doesn't get me killed in ASCII-only encoding
    benchmarks ( I really
    do like the idea of making my format less ASCII-biased, though. And,
    like I said before, I don't care much about sticking to a standard
    encoding -- if stock SCSU ends up being too slow or complex, I might
    still be able to use techniques from SCSU in a custom encoding.

    (Philippe: when I said I needed 20 bits, I meant that I needed 20 bits
    for the stuff after the BMP. I fully intend for my encoding to handle
    every Unicode codepoint, minus surrogates.)

    Thanks again, everyone.
    -- Kannan

    On Wed, Jun 2, 2010 at 13:12, Asmus Freytag <> wrote:
    > On 6/2/2010 12:25 AM, Kannan Goundan wrote:
    >> On Tue, Jun 1, 2010 at 23:30, Asmus Freytag <> wrote:
    >>> Why not use SCSU?
    >>> You get the small size and the encoder/decoder aren't that
    >>> complicated.
    >> Hmm... I had skimmed the SCSU document a few days ago. At the time it
    >> seemed a bit more complicated than I wanted. What's nice about UTF-8
    >> and UTF-16-like encodings is that the space usage is predictable.
    >> But maybe I'll take a closer look. If a simple SCSU encoder can do
    >> better than more "standard" encodings 99% of the time, then maybe it's
    >> worth it...
    > It will, because it's designed to compress commonly used characters.
    > Start with the existing sample code and optimize it. Many features of SCSU
    > are optional, using them gives slightly better compression, but you don't
    > always have to use them and the result is still legal SCSU. Sometimes
    > leaving out a feature can make your encoder a tad simpler, although I found
    > that you can be pretty fast with decent performance.
    > A./

    This archive was generated by hypermail 2.1.5 : Wed Jun 02 2010 - 20:13:46 CDT