Least used parts of BMP.

From: Kannan Goundan (kannan@cakoose.com)
Date: Tue Jun 01 2010 - 22:04:24 CDT

  • Next message: Peter Constable: "RE: Greek letter "LAMDA"?"

    I'm trying to come up with a compact encoding for Unicode strings for
    data serialization purposes.  The goals are fast read/write and small
    size.

    The plan:
    1. BMP code points are encoded as two bytes (0x0000-0xFFFF, minus surrogates).
    2. Non-BMP code points are encoded as three bytes
    - The first two bytes are code points from the BMP's UTF-16 surrogate
    range (11 bits of data)
    - The next byte provides an additional 8 bits of data.

    Unfortunately, this doesn't quite work because it only gives me 19
    bits to encode non-BMP code points, but I need 20 bits.  To solve this
    problem, I'm planning on stealing a bit of code space from the BMP the
    private-use area.  If I did, then:
    - I'd get the bits needed to encoded the Non-BMP in 3 bytes.
    - The stolen code points of the private-use area would now have to be
    encoded using 3 bytes.

    I chose the private-use area because I assumed it would be the least
    commonly used, so having these code points require 3 bytes instead of
    2 bytes wasn't that big a deal.  Does this sound reasonable?  Do
    people suggest a different section of the BMP to steal from, or a
    different encoding altogether?

    Thanks for reading.
    -- Kannan

    P.S. I actually have two encodings.  One is similar to UTF-8 in that
    it's ASCII-biased.  The encoding described above is intended for
    non-ASCII-biased data.  The programmer selects which encoding to use
    based on what he thinks the data looks like.



    This archive was generated by hypermail 2.1.5 : Wed Jun 02 2010 - 00:20:19 CDT