Subject: Re: 32'nd bit & UTF-8

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Tue Jan 18 2005 - 02:44:46 CST

  • Next message: Antoine Leca: "Re: 32'nd bit & UTF-8"

    It's just that the table is incomplete. If you want to extend it further, the
    mechanism is completely obvious.

    0x00...0x7F: 0xxxxxxx
    0x80...0x7FF: 110xxxxx 10xxxxxx
    0x800...0xFFFF: 1110xxxx 10xxxxxx 10xxxxxx
    0x10000...0x1FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    0x200000...0x3FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    0x4000000...0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    0x80000000 - 0xFFFFFFFFF: 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    10xxxxxx
    0x1000000000 - 0x3FFFFFFFFFF: 11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    10xxxxxx 10xxxxxx 10xxxxxx

    See the last couple of lines I added. This allows you to encode numbers all the
    way up to 2^42-1.

    It is important NOT to call this encoding UTF-8, however. By all means use it,
    but call it something else. Also, observse that I have not dared to write mad
    stuff like "U-04000000 - U-7FFFFFFF:" (which I found in the original table). I
    used the "0x" prefix, not the "U+" (or "U-")? prefix. Of course, it's perfectly
    obvious to us here that "0x" indicates "an integer, expressed in hexadecimal",
    whereas "U+" means "a Unicode character", so "U+04000000" makes no sense. But
    the table you quoted used "U-" not "U+". I don't know what that means. (Perhaps
    someone could tell me?)

    Anyway, the "limit" you quoted was purely because they didn't add the extra
    lines. (You could even extend the mechanism to infinity if you wanted, by
    allowing the lead byte 0xFF to mean "and an unlimited number of trail bytes")

    Jill
    -----Original Message-----

    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On

    Behalf Of Hans Aberg

    Sent: 18 January 2005 00:19

    To: Kenneth Whistler

    Cc: unicode@unicode.org; kenw@sybase.com

    Subject: Re: 32'nd bit & UTF-8

    Sorry, I have been somewhat sloppy in my terminology. I am well aware of

    that strictly speaking UTF-8 and UTF-32 are defined only for those 21-bit

    values. But in <http://www.cl.cam.ac.uk/~mgk25/unicode.html> there is an

    extension handling 32-bit numbers. It seems strange that this extension

    excludes the full 32-bits.



    This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 02:47:18 CST