Re: 32'nd bit & UTF-8

From: Hans Aberg (
Date: Wed Jan 19 2005 - 12:38:22 CST

  • Next message: John H. Jenkins: "Re: Subject: Re: 32'nd bit & UTF-8"

    On 2005/01/19 08:37, Doug Ewell at wrote:

    > Hans Aberg <haberg at math dot su dot se> wrote:
    >>>>> The old RFC you're refering to is not designating UTF-8, but
    >>>>> UTF-BSS, which is a transformation format,
    >>>> OK. Fine, so we have a name for it.
    >>> I was not sure about the name of it when writing the message.
    >> According to <>, UTF is
    >> short for UCS Transformation Format, where UCS stands for Universal
    >> Character Set. When speaking about the extensions that I speak about,
    >> I think they should certainly have a separate name. Perhaps UTF-8X for
    >> extended, or BTF-8 for "bit (byte) transformation format".
    > RFC 2044, the original (1996) Internet definition of UTF-8, defined up
    > to 6-byte sequences.
    > While RFC 2044 has been superseded (RFC 2279, 1998) and re-superseded
    > (RFC 3629, 2003), and the 5- and 6-byte sequences have been removed, the
    > point is that they were originally defined in an encoding scheme called
    > "UTF-8." It is not true that they were only defined under some other
    > name, such as FSS-UTF (the name used in Unicode 1.1) or "UTF-BSS,"
    > whatever that is.
    > "BTF-8" is taken; see:

    Another name might be CPBTF-8 for "code point to binary transformation

    >> The Unicode standard is like Big Brother in George Orwell's "1984",
    >> making it possible to only speak about what is right, but not what is
    >> wrong.
    > My goodness.

    Chocking, isn't it? :-)

    >> Besides, even though Unicode has declared to never use more than 21
    >> bits, in the track record, Unicode has reneged on such promises. It
    >> might be prudent to knock down a full 32-bit encoding, declaring
    >> UTF-8/32 to be subsets of that.
    > I suppose the "promise" that you are referring to, on which Unicode
    > "reneged," was the original 16-bit design that was extended with the use
    > of surrogate pairs.


    > The difference between finding 65,000 things that need to be encoded and
    > finding 1.1 million things that need to be encoded is the difference
    > between night and day.

    I can only refer to the development of the Bison parser generator
    <>. There, the number of tokens, states, etc where ofthen
    limted to 2^15. But it turns out that people want, in vfiew of more powerful
    computers to plug in larger and larger grammars. One can then plug in really
    large machine genrated grammars. This way, one might plug in grammars with
    millions of tokens, for example. So these lower limits are being now

    So, as long as these Unicode encodings will only be used for human
    enumeration characters, 1 million is perhaps well within the boundaries. But
    if somebody comes up with some clever machine enunciation, then it might be

      Hans Aberg

    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 12:59:56 CST