From: Hans Aberg (firstname.lastname@example.org)
Date: Tue Jan 18 2005 - 12:52:16 CST
On 2005/01/18 09:44, Arcane Jill at email@example.com wrote:
> It's just that the table is incomplete. If you want to extend it further, the
> mechanism is completely obvious.
> 0x00...0x7F: 0xxxxxxx
> 0x80...0x7FF: 110xxxxx 10xxxxxx
> 0x800...0xFFFF: 1110xxxx 10xxxxxx 10xxxxxx
> 0x10000...0x1FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
> 0x200000...0x3FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> 0x4000000...0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> 0x80000000 - 0xFFFFFFFFF: 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> 0x1000000000 - 0x3FFFFFFFFFF: 11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> 10xxxxxx 10xxxxxx 10xxxxxx
> See the last couple of lines I added. This allows you to encode numbers all
> way up to 2^42-1.
The problem with this suggestion is that it of less use when trying to
describe 32-bit numbers. 32 bits is the natural computer alignment, so that
is what one is stuck with.
> It is important NOT to call this encoding UTF-8, however. By all means use it,
> but call it something else. Also, observse that I have not dared to write mad
> stuff like "U-04000000 - U-7FFFFFFF:" (which I found in the original table). I
> used the "0x" prefix, not the "U+" (or "U-")? prefix. Of course, it's
> obvious to us here that "0x" indicates "an integer, expressed in hexadecimal",
> whereas "U+" means "a Unicode character", so "U+04000000" makes no sense. But
> the table you quoted used "U-" not "U+". I don't know what that means.
> someone could tell me?)
Philippe VERDY has already pointed out that the 31-bit version is called
UTF-BSS. UTF-8 is restricted to at or below 0x10FFFF.
> Anyway, the "limit" you quoted was purely because they didn't add the extra
> lines. (You could even extend the mechanism to infinity if you wanted, by
> allowing the lead byte 0xFF to mean "and an unlimited number of trail bytes")
The question came up purely, because the lexer generator needs to handle, in
"UTF-BSS-32" format, all 2^32 numbers, when proper lexer diagnostics is
taken into account. Therefore I thought it would be same to be able to do it
with "UTF-BSS-8". Since there is a natural 32-bit alignment used in
computers today, no higher numbers are needed.
If one should philosophize on the question of general multi-byte encodings
(or rather "transformation formats"), then UTF-BSS uses a leading byte the
number of bytes displayed in a unary number format, numbers of base 1. In
fact, in a computer, it is more efficient to use binary numbers :-), so I
would probably put a binary number there for instead. One could still use
the unary number idea in order to indicate the length of the binary numbers.
This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 12:54:44 CST