Re: 32'nd bit & UTF-8

From: Doug Ewell ([email protected])
Date: Tue Jan 18 2005 - 23:44:01 CST

Next message: Doug Ewell: "Re: Coptic II"

Previous message: Philippe Verdy: "Re: 32'nd bit & UTF-8"
In reply to: Hans Aberg: "Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg <haberg at math dot su dot se> wrote:

> The UTF-BSS ("UTF-8") is not sensitive to the big/endian issue. And
> perhaps people might invent other, creative uses.

Here's a creative use that shows how UTF-8 does NOT need to overloaded
in this way.

I'm developing a "database" (not in the formal sense) of Unicode
character names, and one of my design goals is to keep the size of the
file down. I'm storing each word separately as a token, and using
zero-terminated strings to store sequences of tokens.

Obviously some words, such as LETTER, occur more often in character
names than others, such as ZZURX, and so I wanted to be able to store
commonly occurring tokens in fewer bytes than less common tokens. That
initially pointed me toward using UTF-8 for the strings of tokens, even
though the UTF-8 sequences wouldn't really be representing "characters"
as such.

But eventually, I realized that my requirements for this format aren't
the same that drove the creation of UTF-8:

* I don't need non-overlapping ranges for lead and trail bytes.
* I don't need rapid encoding time.
* I don't need to maintain ASCII safety, except for the zero terminator.
* I DO need the smallest possible average size per token.

So instead of UTF-8, my format uses single bytes from 0x01 through X for
the X most common tokens, and two-byte sequences with a lead byte from
(X + 1) through 0xFF and a trail byte from 0x01 to 0xFF for the rest,
where X is chosen for best fit as follows:

number of tokens >= X + (255 * (255 - X))

Notice that this format employs some principles of UTF-8, but it's not
UTF-8, or even an extension thereof. It's optimized to solve the
problem at hand. And that is how it should be.

I suppose we can't stop you from "extending" UTF-8 to solve a problem
for which it is not appropriate, just as we couldn't stop others before
you. All we can say is, it's not recommended.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Doug Ewell: "Re: Coptic II"
Previous message: Philippe Verdy: "Re: 32'nd bit & UTF-8"
In reply to: Hans Aberg: "Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 23:48:03 CST