Re: 32'nd bit & UTF-8

From: Hans Aberg (
Date: Wed Jan 19 2005 - 12:38:22 CST

  • Next message: Philippe VERDY: "Re: RE: 32'nd bit & UTF-8"

    On 2005/01/19 06:44, Doug Ewell at wrote:

    > Hans Aberg <haberg at math dot su dot se> wrote:
    >> The UTF-BSS ("UTF-8") is not sensitive to the big/endian issue. And
    >> perhaps people might invent other, creative uses.
    > Here's a creative use that shows how UTF-8 does NOT need to overloaded
    > in this way.
    > I'm developing a "database" (not in the formal sense) of Unicode
    > character names, and one of my design goals is to keep the size of the
    > file down. I'm storing each word separately as a token, and using
    > zero-terminated strings to store sequences of tokens.
    > Obviously some words, such as LETTER, occur more often in character
    > names than others, such as ZZURX, and so I wanted to be able to store
    > commonly occurring tokens in fewer bytes than less common tokens. That
    > initially pointed me toward using UTF-8 for the strings of tokens, even
    > though the UTF-8 sequences wouldn't really be representing "characters"
    > as such.
    > But eventually, I realized that my requirements for this format aren't
    > the same that drove the creation of UTF-8:

    What you describe here is a special case of data compression algorithms. You
    may benefit to look up some of those. Of course, UTF-8 is only one format,
    suitable for communications of Unicode code points. Other applications
    should use other formats.

    The extension we have discussed here has one interesting property,
    endianness insensitivity. There are a number of binary formats which are
    otherwise better suitable for distributed code applications, such as CORBA,
    etc. But if one has a 32-bit file, and wants it put up on the Internet, and
    be sure that endianness comes out right´, I just noted that such a UTF-8
    extension could be used for that. Most likely, people are developing other
    such byte-formats, for special use. This is probably not really of much
    concern to Unicode. But if, for some unforeseen reason, one would want to go
    beyond the 21-bit limit, it might be good to know what it should look like.
    And in my regular expression generator, I can do whatever I want, once I go
    beyond the 21-bit limit -- I need only to make sure that the user of it
    finds it convenient.

      Hans Aberg

    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 12:39:36 CST