Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Tue Jan 18 2005 - 12:52:16 CST

  • Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

    On 2005/01/18 09:44, Arcane Jill at arcanejill@ramonsky.com wrote:

    > It's just that the table is incomplete. If you want to extend it further, the
    > mechanism is completely obvious.
    >
    > 0x00...0x7F: 0xxxxxxx
    > 0x80...0x7FF: 110xxxxx 10xxxxxx
    > 0x800...0xFFFF: 1110xxxx 10xxxxxx 10xxxxxx
    > 0x10000...0x1FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    > 0x200000...0x3FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    > 0x4000000...0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    > 0x80000000 - 0xFFFFFFFFF: 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    > 10xxxxxx
    > 10xxxxxx
    > 0x1000000000 - 0x3FFFFFFFFFF: 11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    > 10xxxxxx 10xxxxxx 10xxxxxx
    >
    > See the last couple of lines I added. This allows you to encode numbers all
    > the
    > way up to 2^42-1.

    The problem with this suggestion is that it of less use when trying to
    describe 32-bit numbers. 32 bits is the natural computer alignment, so that
    is what one is stuck with.

    > It is important NOT to call this encoding UTF-8, however. By all means use it,
    > but call it something else. Also, observse that I have not dared to write mad
    > stuff like "U-04000000 - U-7FFFFFFF:" (which I found in the original table). I
    > used the "0x" prefix, not the "U+" (or "U-")? prefix. Of course, it's
    > perfectly
    > obvious to us here that "0x" indicates "an integer, expressed in hexadecimal",
    > whereas "U+" means "a Unicode character", so "U+04000000" makes no sense. But
    > the table you quoted used "U-" not "U+". I don't know what that means.
    > (Perhaps
    > someone could tell me?)

    Philippe VERDY has already pointed out that the 31-bit version is called
    UTF-BSS. UTF-8 is restricted to at or below 0x10FFFF.

    > Anyway, the "limit" you quoted was purely because they didn't add the extra
    > lines. (You could even extend the mechanism to infinity if you wanted, by
    > allowing the lead byte 0xFF to mean "and an unlimited number of trail bytes")

    The question came up purely, because the lexer generator needs to handle, in
    "UTF-BSS-32" format, all 2^32 numbers, when proper lexer diagnostics is
    taken into account. Therefore I thought it would be same to be able to do it
    with "UTF-BSS-8". Since there is a natural 32-bit alignment used in
    computers today, no higher numbers are needed.

    If one should philosophize on the question of general multi-byte encodings
    (or rather "transformation formats"), then UTF-BSS uses a leading byte the
    number of bytes displayed in a unary number format, numbers of base 1. In
    fact, in a computer, it is more efficient to use binary numbers :-), so I
    would probably put a binary number there for instead. One could still use
    the unary number idea in order to indicate the length of the binary numbers.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 12:54:44 CST