From: Hans Aberg (haberg@math.su.se)
Date: Mon Jan 17 2005 - 18:56:59 CST
[Warning: For some reason your email got high spam levels with me. It caused
it to be sorted as spam.]
At 10:46 -0800 2005/01/17, Doug Ewell wrote:
>Hans Aberg <haberg at math dot su dot se> wrote:
>
>> Are there any good reasons for UTF-32 to exclude the 32'nd bit of an
>> encoded 4-byte? I.e, the 6-byte combinations
>> 111111xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
>> where the first x = 1.
>
>(I think you mean UTF-8 rather than UTF-32.)
Right. Sorry for the typo.
>This is not a limit of UTF-8 per se, but of the underlying ISO/IEC 10646
>encoding. It was intentionally designed as a 31-bit value (bit 31 = 0)
>to prevent implementation problems in "signed integer" environments.
>Java, for example, has no unsigned 32-bit integer type.
OK. But the main thing is that there are 32 bits in the type. Then it
actually does not matter whther it is signed or unsigned. Just translate
xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
into
111111xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
regardless whther the underlying type is signed or unsigned.
>> With a full 32-bit encoding, one can also use UTF-8 to encoding binary
>> data. It also simplifies somewhat the implementation of Unicode in
>> lexer generators (such as Flex): The leading byte then covers all 256
>> combinations. All 2^32 numbers should probably be there for generating
>> proper lexer error messages.
>
>This is not what UTF-8 is for. It is a multi-byte encoding scheme
>intended to cover the entire ISO 10646 and Unicode space while remaining
>ASCII-compatible. Formats intended for arbitrary binary data have
>neither the same requirements nor the same constraints. In fact, for
>randomly occurring binary data, UTF-8 will use 5 or 6 bytes to represent
>a 32-bit value 99.9% of the time, with none of the obvious benefits
>(like control-code transparency) for which special encoding formats for
>binary data are usually employed.
Things have a tendencey to be used for what they were not intedned for. One
advantage of the UTF-8 encoding (or rather the extended version) is that it
is insensitive to the high/low endian issue. Of course, special binary
protocols are better. But one way to make sure endian comes out righjt s to
pass it through UTF-8.
>Allowing the lead byte in a UTF-8 sequence to extend to 0xFF may
>actually complicate the implementation of UTF-8, by breaking the rule
>that the number of bytes in a multi-byte sequence (i.e. more than 1
>byte) can be determined from the number of leading 1-bits in the lead
>byte.
Instead one just checks the first six 1's. It can't be hard to handle that,
even when writing code directly by hand.
>The leading byte in a UTF-8 sequence does not cover all 256 combinations
>in any event. The values 0x80 through 0xBF are trailing bytes only, and
>0xC0 and 0xC1 will also never occur as the lead byte in a properly
>formed sequence.
>
>This is all moot anyway, since Unicode and ISO 10646 have restricted the
>range of code points to [0x00, 0x10FFFF], with the result that no 5-byte
>or 6-byte sequences represent valid code points.
I gave more motivations in the letter to Kenneth Whistler. In a lexer
generator, such as Flex, it is seems convenient to cover up those values, as
it is not general known what should happen at those values. So those values
will have to be covered up in the implementation anyhow. Then I it might
good that Unicode thinks those questions through and gives some support to
it.
Hans Aberg
This archive was generated by hypermail 2.1.5 : Mon Jan 17 2005 - 19:09:03 CST