From: Doug Ewell (email@example.com)
Date: Mon Jan 17 2005 - 12:46:52 CST
Hans Aberg <haberg at math dot su dot se> wrote:
> Are there any good reasons for UTF-32 to exclude the 32'nd bit of an
> encoded 4-byte? I.e, the 6-byte combinations
> 111111xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> where the first x = 1.
(I think you mean UTF-8 rather than UTF-32.)
This is not a limit of UTF-8 per se, but of the underlying ISO/IEC 10646
encoding. It was intentionally designed as a 31-bit value (bit 31 = 0)
to prevent implementation problems in "signed integer" environments.
Java, for example, has no unsigned 32-bit integer type.
> With a full 32-bit encoding, one can also use UTF-8 to encoding binary
> data. It also simplifies somewhat the implementation of Unicode in
> lexer generators (such as Flex): The leading byte then covers all 256
> combinations. All 2^32 numbers should probably be there for generating
> proper lexer error messages.
This is not what UTF-8 is for. It is a multi-byte encoding scheme
intended to cover the entire ISO 10646 and Unicode space while remaining
ASCII-compatible. Formats intended for arbitrary binary data have
neither the same requirements nor the same constraints. In fact, for
randomly occurring binary data, UTF-8 will use 5 or 6 bytes to represent
a 32-bit value 99.9% of the time, with none of the obvious benefits
(like control-code transparency) for which special encoding formats for
binary data are usually employed.
Allowing the lead byte in a UTF-8 sequence to extend to 0xFF may
actually complicate the implementation of UTF-8, by breaking the rule
that the number of bytes in a multi-byte sequence (i.e. more than 1
byte) can be determined from the number of leading 1-bits in the lead
The leading byte in a UTF-8 sequence does not cover all 256 combinations
in any event. The values 0x80 through 0xBF are trailing bytes only, and
0xC0 and 0xC1 will also never occur as the lead byte in a properly
This is all moot anyway, since Unicode and ISO 10646 have restricted the
range of code points to [0x00, 0x10FFFF], with the result that no 5-byte
or 6-byte sequences represent valid code points.
This archive was generated by hypermail 2.1.5 : Mon Jan 17 2005 - 12:55:13 CST