From: Arcane Jill (arcanejill@ramonsky.com)
Date: Tue Jan 18 2005 - 02:44:46 CST
It's just that the table is incomplete. If you want to extend it further, the
mechanism is completely obvious.
0x00...0x7F: 0xxxxxxx
0x80...0x7FF: 110xxxxx 10xxxxxx
0x800...0xFFFF: 1110xxxx 10xxxxxx 10xxxxxx
0x10000...0x1FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0x200000...0x3FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0x4000000...0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0x80000000 - 0xFFFFFFFFF: 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx
0x1000000000 - 0x3FFFFFFFFFF: 11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx 10xxxxxx 10xxxxxx
See the last couple of lines I added. This allows you to encode numbers all the
way up to 2^42-1.
It is important NOT to call this encoding UTF-8, however. By all means use it,
but call it something else. Also, observse that I have not dared to write mad
stuff like "U-04000000 - U-7FFFFFFF:" (which I found in the original table). I
used the "0x" prefix, not the "U+" (or "U-")? prefix. Of course, it's perfectly
obvious to us here that "0x" indicates "an integer, expressed in hexadecimal",
whereas "U+" means "a Unicode character", so "U+04000000" makes no sense. But
the table you quoted used "U-" not "U+". I don't know what that means. (Perhaps
someone could tell me?)
Anyway, the "limit" you quoted was purely because they didn't add the extra
lines. (You could even extend the mechanism to infinity if you wanted, by
allowing the lead byte 0xFF to mean "and an unlimited number of trail bytes")
Jill
-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of Hans Aberg
Sent: 18 January 2005 00:19
To: Kenneth Whistler
Cc: unicode@unicode.org; kenw@sybase.com
Subject: Re: 32'nd bit & UTF-8
Sorry, I have been somewhat sloppy in my terminology. I am well aware of
that strictly speaking UTF-8 and UTF-32 are defined only for those 21-bit
values. But in <http://www.cl.cam.ac.uk/~mgk25/unicode.html> there is an
extension handling 32-bit numbers. It seems strange that this extension
excludes the full 32-bits.
This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 02:47:18 CST