Re: 32'nd bit & UTF-8

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jan 17 2005 - 13:12:02 CST

Next message: Peter Constable: "RE: [hebrew] Re: Hebrew combining classes (was ISO 10646 compliance and EU law)"

Previous message: Doug Ewell: "Re: 32'nd bit & UTF-8"
Maybe in reply to: Hans Aberg: "32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg asked:

> Are there any good reasons for UTF-32 to exclude the 32'nd bit of an encoded
> 4-byte?

Yes. In fact there are good reasons for it to exclude the 22nd
through 31st bits, as well.

UTF-32 is only *defined* on the range U+0000..U+10FFFF.
(Actually, U+0000..U+D7FF, U+E000..U+10FFFF.)

> I.e, the 6-byte combinations
> 111111xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> where the first x = 1.
>
> With a full 32-bit encoding, one can also use UTF-8 to encoding binary data.

No, one cannot. Using UTF-8 to encode binary data is horribly
non-conformant. UTF-8 is not a representation for binary data,
it is a character encoding form for *encoded character* in
Unicode/10646, defined on the same range of code points as
UTF-32.

> It also simplifies somewhat the implementation of Unicode in lexer
> generators (such as Flex): The leading byte then covers all 256
> combinations. All 2^32 numbers should probably be there for generating
> proper lexer error messages.

There is nothing preventing your lexer implementation from simply
putting UTF-32 data into 32-bit registers as 32-bit unsigned
integers. Everybody does that. And all 32-bit integral values are
there for generating whatever lexer error messages you want.
But the valid *character* values are only U+0000..U+D7FF, U+E000..
U+10FFFF. If your lexer runs into some other numerical value,
then it is dealing with bogus data that cannot be interpreted
as a Unicode character.

--Ken Whistler

>
> Hans Aberg
>
>
>

Next message: Peter Constable: "RE: [hebrew] Re: Hebrew combining classes (was ISO 10646 compliance and EU law)"
Previous message: Doug Ewell: "Re: 32'nd bit & UTF-8"
Maybe in reply to: Hans Aberg: "32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 17 2005 - 13:15:53 CST