Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Tue Jan 18 2005 - 12:52:16 CST

Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

Previous message: Hans Aberg: "Re: 32'nd bit & UTF-8"
In reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Jon Hanna: "RE: 32'nd bit & UTF-8"
Reply: Jon Hanna: "RE: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2005/01/18 09:44, Arcane Jill at arcanejill@ramonsky.com wrote:

> It's just that the table is incomplete. If you want to extend it further, the
> mechanism is completely obvious.
>
> 0x00...0x7F: 0xxxxxxx
> 0x80...0x7FF: 110xxxxx 10xxxxxx
> 0x800...0xFFFF: 1110xxxx 10xxxxxx 10xxxxxx
> 0x10000...0x1FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
> 0x200000...0x3FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> 0x4000000...0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> 0x80000000 - 0xFFFFFFFFF: 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> 10xxxxxx
> 10xxxxxx
> 0x1000000000 - 0x3FFFFFFFFFF: 11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> 10xxxxxx 10xxxxxx 10xxxxxx
>
> See the last couple of lines I added. This allows you to encode numbers all
> the
> way up to 2^42-1.

The problem with this suggestion is that it of less use when trying to
describe 32-bit numbers. 32 bits is the natural computer alignment, so that
is what one is stuck with.

> It is important NOT to call this encoding UTF-8, however. By all means use it,
> but call it something else. Also, observse that I have not dared to write mad
> stuff like "U-04000000 - U-7FFFFFFF:" (which I found in the original table). I
> used the "0x" prefix, not the "U+" (or "U-")? prefix. Of course, it's
> perfectly
> obvious to us here that "0x" indicates "an integer, expressed in hexadecimal",
> whereas "U+" means "a Unicode character", so "U+04000000" makes no sense. But
> the table you quoted used "U-" not "U+". I don't know what that means.
> (Perhaps
> someone could tell me?)

Philippe VERDY has already pointed out that the 31-bit version is called
UTF-BSS. UTF-8 is restricted to at or below 0x10FFFF.

> Anyway, the "limit" you quoted was purely because they didn't add the extra
> lines. (You could even extend the mechanism to infinity if you wanted, by
> allowing the lead byte 0xFF to mean "and an unlimited number of trail bytes")

The question came up purely, because the lexer generator needs to handle, in
"UTF-BSS-32" format, all 2^32 numbers, when proper lexer diagnostics is
taken into account. Therefore I thought it would be same to be able to do it
with "UTF-BSS-8". Since there is a natural 32-bit alignment used in
computers today, no higher numbers are needed.

If one should philosophize on the question of general multi-byte encodings
(or rather "transformation formats"), then UTF-BSS uses a leading byte the
number of bytes displayed in a unary number format, numbers of base 1. In
fact, in a computer, it is more efficient to use binary numbers :-), so I
would probably put a binary number there for instead. One could still use
the unary number idea in order to indicate the length of the binary numbers.

Hans Aberg

Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"
Previous message: Hans Aberg: "Re: 32'nd bit & UTF-8"
In reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Jon Hanna: "RE: 32'nd bit & UTF-8"
Reply: Jon Hanna: "RE: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 12:54:44 CST