From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jan 17 2005 - 13:12:02 CST
Hans Aberg asked:
> Are there any good reasons for UTF-32 to exclude the 32'nd bit of an encoded
> 4-byte?
Yes. In fact there are good reasons for it to exclude the 22nd
through 31st bits, as well.
UTF-32 is only *defined* on the range U+0000..U+10FFFF.
(Actually, U+0000..U+D7FF, U+E000..U+10FFFF.)
> I.e, the 6-byte combinations
> 111111xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> where the first x = 1.
>
> With a full 32-bit encoding, one can also use UTF-8 to encoding binary data.
No, one cannot. Using UTF-8 to encode binary data is horribly
non-conformant. UTF-8 is not a representation for binary data,
it is a character encoding form for *encoded character* in
Unicode/10646, defined on the same range of code points as
UTF-32.
> It also simplifies somewhat the implementation of Unicode in lexer
> generators (such as Flex): The leading byte then covers all 256
> combinations. All 2^32 numbers should probably be there for generating
> proper lexer error messages.
There is nothing preventing your lexer implementation from simply
putting UTF-32 data into 32-bit registers as 32-bit unsigned
integers. Everybody does that. And all 32-bit integral values are
there for generating whatever lexer error messages you want.
But the valid *character* values are only U+0000..U+D7FF, U+E000..
U+10FFFF. If your lexer runs into some other numerical value,
then it is dealing with bogus data that cannot be interpreted
as a Unicode character.
--Ken Whistler
>
> Hans Aberg
>
>
>
This archive was generated by hypermail 2.1.5 : Mon Jan 17 2005 - 13:15:53 CST