From: Doug Ewell (email@example.com)
Date: Tue Jan 18 2005 - 23:44:01 CST
Hans Aberg <haberg at math dot su dot se> wrote:
> The UTF-BSS ("UTF-8") is not sensitive to the big/endian issue. And
> perhaps people might invent other, creative uses.
Here's a creative use that shows how UTF-8 does NOT need to overloaded
in this way.
I'm developing a "database" (not in the formal sense) of Unicode
character names, and one of my design goals is to keep the size of the
file down. I'm storing each word separately as a token, and using
zero-terminated strings to store sequences of tokens.
Obviously some words, such as LETTER, occur more often in character
names than others, such as ZZURX, and so I wanted to be able to store
commonly occurring tokens in fewer bytes than less common tokens. That
initially pointed me toward using UTF-8 for the strings of tokens, even
though the UTF-8 sequences wouldn't really be representing "characters"
But eventually, I realized that my requirements for this format aren't
the same that drove the creation of UTF-8:
* I don't need non-overlapping ranges for lead and trail bytes.
* I don't need rapid encoding time.
* I don't need to maintain ASCII safety, except for the zero terminator.
* I DO need the smallest possible average size per token.
So instead of UTF-8, my format uses single bytes from 0x01 through X for
the X most common tokens, and two-byte sequences with a lead byte from
(X + 1) through 0xFF and a trail byte from 0x01 to 0xFF for the rest,
where X is chosen for best fit as follows:
number of tokens >= X + (255 * (255 - X))
Notice that this format employs some principles of UTF-8, but it's not
UTF-8, or even an extension thereof. It's optimized to solve the
problem at hand. And that is how it should be.
I suppose we can't stop you from "extending" UTF-8 to solve a problem
for which it is not appropriate, just as we couldn't stop others before
you. All we can say is, it's not recommended.
This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 23:48:03 CST