From: Hans Aberg (haberg@math.su.se)
Date: Tue Jan 18 2005 - 12:52:17 CST
On 2005/01/18 17:34, Jon Hanna at jon@hackcraft.net wrote:
>> 0x00...0x7F: 0xxxxxxx
>> 0x80...0x7FF: 110xxxxx 10xxxxxx
>> 0x800...0xFFFF: 1110xxxx 10xxxxxx 10xxxxxx
>> 0x10000...0x1FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
>> 0x200000...0x3FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
>> 0x4000000...0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx
>> 10xxxxxx 10xxxxxx
>> 0x80000000 - 0xFFFFFFFFF: 11111110 10xxxxxx 10xxxxxx 10xxxxxx
>> 10xxxxxx 10xxxxxx
>> 10xxxxxx
>> 0x1000000000 - 0x3FFFFFFFFFF: 11111111 10xxxxxx 10xxxxxx
>> 10xxxxxx 10xxxxxx
>> 10xxxxxx 10xxxxxx 10xxxxxx
>
> Of course this loses the fact that UTF-8 data will never contain 0xFE or 0xFF
> (and so UTF-16 with a BOM will never be confused with UTF-8, a fact that is
> important to XML parsers for one application).
In <http://www.cl.cam.ac.uk/~mgk25/unicode.html>, the use of BOM is
discouraged for use on UNIX platforms. So if endianness may appear to
becomes a problem, it might be better to use UTF-8 externally, and then
convert it to UTF-32/H/L internally in the program.
Hans Aberg
This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 12:54:32 CST