Re: Least used parts of BMP.

From: Kannan Goundan (kannan@cakoose.com)
Date: Thu Jun 03 2010 - 00:07:26 CDT

  • Next message: Werner LEMBERG: "Re: Greek letter "LAMDA"?"

    On Wed, Jun 2, 2010 at 21:43, Doug Ewell <doug@ewellic.org> wrote:
    >> If you want a really fast alternate encoding, you could encode all of
    >> Unicode in at most 3 bytes.  Use the high bit as a "continuation" bit and
    >> the lower 7 bits as the data.
    >>
    >> ASCII gets passed through unchanged.
    >
    > This is essentially what I was going to suggest to Kannan, since avoidance
    > of ASCII bytes, nulls, etc. is not relevant to his use case. The conversion
    > is lightning-fast; it can be optimized to be even faster than UTF-8.

    This is currently what I do (I was referring to this as the "compact
    UTF-8-like encoding"). The one difference is that I put all the
    marker bits in the first byte (instead of in the high bit of every
    byte):

      0xxxxxxx
      10xxxxxx xyyyyyyy
      110xxxxx xxyyyyyy yzzzzzzz

    This is essentially how I encode integers as well.

    -- Kannan



    This archive was generated by hypermail 2.1.5 : Thu Jun 03 2010 - 00:20:45 CDT