Subject: Re: 32'nd bit & UTF-8

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Tue Jan 18 2005 - 02:44:46 CST

Next message: Antoine Leca: "Re: 32'nd bit & UTF-8"

Previous message: Doug Ewell: "Re: ISO 15924 update"
Next in thread: Jon Hanna: "RE: Subject: Re: 32'nd bit & UTF-8"
Reply: Jon Hanna: "RE: Subject: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe reply: D. Starner: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Oliver Christ: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Oliver Christ: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Peter Constable: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Peter Constable: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Hans Aberg: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Rick McGowan: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Richard T. Gillam: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Kenneth Whistler: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Richard T. Gillam: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Arcane Jill: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Arcane Jill: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Martin Duerst: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Martin Duerst: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Martin Duerst: "RE: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

It's just that the table is incomplete. If you want to extend it further, the
mechanism is completely obvious.

0x00...0x7F: 0xxxxxxx
0x80...0x7FF: 110xxxxx 10xxxxxx
0x800...0xFFFF: 1110xxxx 10xxxxxx 10xxxxxx
0x10000...0x1FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0x200000...0x3FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0x4000000...0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0x80000000 - 0xFFFFFFFFF: 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx
0x1000000000 - 0x3FFFFFFFFFF: 11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx 10xxxxxx 10xxxxxx

See the last couple of lines I added. This allows you to encode numbers all the
way up to 2^42-1.

It is important NOT to call this encoding UTF-8, however. By all means use it,
but call it something else. Also, observse that I have not dared to write mad
stuff like "U-04000000 - U-7FFFFFFF:" (which I found in the original table). I
used the "0x" prefix, not the "U+" (or "U-")? prefix. Of course, it's perfectly
obvious to us here that "0x" indicates "an integer, expressed in hexadecimal",
whereas "U+" means "a Unicode character", so "U+04000000" makes no sense. But
the table you quoted used "U-" not "U+". I don't know what that means. (Perhaps
someone could tell me?)

Anyway, the "limit" you quoted was purely because they didn't add the extra
lines. (You could even extend the mechanism to infinity if you wanted, by
allowing the lead byte 0xFF to mean "and an unlimited number of trail bytes")

Jill
-----Original Message-----

From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On

Behalf Of Hans Aberg

Sent: 18 January 2005 00:19

To: Kenneth Whistler

Cc: unicode@unicode.org; kenw@sybase.com

Subject: Re: 32'nd bit & UTF-8

Sorry, I have been somewhat sloppy in my terminology. I am well aware of

that strictly speaking UTF-8 and UTF-32 are defined only for those 21-bit

values. But in <http://www.cl.cam.ac.uk/~mgk25/unicode.html> there is an

extension handling 32-bit numbers. It seems strange that this extension

excludes the full 32-bits.

Next message: Antoine Leca: "Re: 32'nd bit & UTF-8"
Previous message: Doug Ewell: "Re: ISO 15924 update"
Next in thread: Jon Hanna: "RE: Subject: Re: 32'nd bit & UTF-8"
Reply: Jon Hanna: "RE: Subject: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe reply: D. Starner: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Oliver Christ: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Oliver Christ: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Peter Constable: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Peter Constable: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Hans Aberg: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Rick McGowan: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Richard T. Gillam: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Kenneth Whistler: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Richard T. Gillam: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Arcane Jill: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Arcane Jill: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Martin Duerst: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Martin Duerst: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe reply: Martin Duerst: "RE: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 02:47:18 CST