Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Tue Jan 18 2005 - 12:52:17 CST

  • Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

    On 2005/01/18 17:34, Jon Hanna at jon@hackcraft.net wrote:

    >> 0x00...0x7F: 0xxxxxxx
    >> 0x80...0x7FF: 110xxxxx 10xxxxxx
    >> 0x800...0xFFFF: 1110xxxx 10xxxxxx 10xxxxxx
    >> 0x10000...0x1FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    >> 0x200000...0x3FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    >> 0x4000000...0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx
    >> 10xxxxxx 10xxxxxx
    >> 0x80000000 - 0xFFFFFFFFF: 11111110 10xxxxxx 10xxxxxx 10xxxxxx
    >> 10xxxxxx 10xxxxxx
    >> 10xxxxxx
    >> 0x1000000000 - 0x3FFFFFFFFFF: 11111111 10xxxxxx 10xxxxxx
    >> 10xxxxxx 10xxxxxx
    >> 10xxxxxx 10xxxxxx 10xxxxxx
    >
    > Of course this loses the fact that UTF-8 data will never contain 0xFE or 0xFF
    > (and so UTF-16 with a BOM will never be confused with UTF-8, a fact that is
    > important to XML parsers for one application).

    In <http://www.cl.cam.ac.uk/~mgk25/unicode.html>, the use of BOM is
    discouraged for use on UNIX platforms. So if endianness may appear to
    becomes a problem, it might be better to use UTF-8 externally, and then
    convert it to UTF-32/H/L internally in the program.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 12:54:32 CST