Re: 32'nd bit & UTF-8

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Jan 18 2005 - 23:44:01 CST

  • Next message: Doug Ewell: "Re: Coptic II"

    Hans Aberg <haberg at math dot su dot se> wrote:

    > The UTF-BSS ("UTF-8") is not sensitive to the big/endian issue. And
    > perhaps people might invent other, creative uses.

    Here's a creative use that shows how UTF-8 does NOT need to overloaded
    in this way.

    I'm developing a "database" (not in the formal sense) of Unicode
    character names, and one of my design goals is to keep the size of the
    file down. I'm storing each word separately as a token, and using
    zero-terminated strings to store sequences of tokens.

    Obviously some words, such as LETTER, occur more often in character
    names than others, such as ZZURX, and so I wanted to be able to store
    commonly occurring tokens in fewer bytes than less common tokens. That
    initially pointed me toward using UTF-8 for the strings of tokens, even
    though the UTF-8 sequences wouldn't really be representing "characters"
    as such.

    But eventually, I realized that my requirements for this format aren't
    the same that drove the creation of UTF-8:

    * I don't need non-overlapping ranges for lead and trail bytes.
    * I don't need rapid encoding time.
    * I don't need to maintain ASCII safety, except for the zero terminator.
    * I DO need the smallest possible average size per token.

    So instead of UTF-8, my format uses single bytes from 0x01 through X for
    the X most common tokens, and two-byte sequences with a lead byte from
    (X + 1) through 0xFF and a trail byte from 0x01 to 0xFF for the rest,
    where X is chosen for best fit as follows:

        number of tokens >= X + (255 * (255 - X))

    Notice that this format employs some principles of UTF-8, but it's not
    UTF-8, or even an extension thereof. It's optimized to solve the
    problem at hand. And that is how it should be.

    I suppose we can't stop you from "extending" UTF-8 to solve a problem
    for which it is not appropriate, just as we couldn't stop others before
    you. All we can say is, it's not recommended.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 23:48:03 CST