Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Mon Jan 17 2005 - 18:19:25 CST

Next message: Hans Aberg: "RE: 32'nd bit & UTF-8"

Previous message: Michael Everson: "ISO 15924 update"
Maybe in reply to: Hans Aberg: "32'nd bit & UTF-8"
Next in thread: Asmus Freytag: "Re: 32'nd bit & UTF-8"
Reply: Asmus Freytag: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 11:12 -0800 2005/01/17, Kenneth Whistler wrote:
>> Are there any good reasons for UTF-32 to exclude the 32'nd bit of an encoded
>> 4-byte?
>
>Yes. In fact there are good reasons for it to exclude the 22nd
>through 31st bits, as well.
>
>UTF-32 is only *defined* on the range U+0000..U+10FFFF.
>(Actually, U+0000..U+D7FF, U+E000..U+10FFFF.)

Sorry, I have been somewhat sloppy in my terminology. I am well aware of
that strictly speaking UTF-8 and UTF-32 are defined only for those 21-bit
values. But in <http://www.cl.cam.ac.uk/~mgk25/unicode.html> there is an
extension handling 32-bit numbers. It seems strange that this extension
excludes the full 32-bits.

>> I.e, the 6-byte combinations
>> 111111xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
>> where the first x = 1.
>>
>> With a full 32-bit encoding, one can also use UTF-8 to encoding binary data.
>
>No, one cannot. Using UTF-8 to encode binary data is horribly
>non-conformant. UTF-8 is not a representation for binary data,
>it is a character encoding form for *encoded character* in
>Unicode/10646, defined on the same range of code points as
>UTF-32.

I will describe my motivations in more detail below.

>> It also simplifies somewhat the implementation of Unicode in lexer
>> generators (such as Flex): The leading byte then covers all 256
>> combinations. All 2^32 numbers should probably be there for generating
>> proper lexer error messages.
>
>There is nothing preventing your lexer implementation from simply
>putting UTF-32 data into 32-bit registers as 32-bit unsigned
>integers. Everybody does that. And all 32-bit integral values are
>there for generating whatever lexer error messages you want.
>But the valid *character* values are only U+0000..U+D7FF, U+E000..
>U+10FFFF. If your lexer runs into some other numerical value,
>then it is dealing with bogus data that cannot be interpreted
>as a Unicode character.

I am not really speaking about the implementation for a lexer directly, but
the implementation of lexers using a lexer (scanner) generator such as Flex.
It is not known at least to the Flex group how to extend Flex to Unicode.
Using UTF-32 and 32-bit directly would require that one implements table
compression.

So last week I thought that one instead translates UTF-8 and UTF-32B/L
regular expressions into 1-byte regular expressions, and the lets Flex
expand those. I wrote some functions in Haskell using Hugs doing just this.
For details, see
List-Info: <http://lists.gnu.org/mailman/listinfo/help-flex>
List-Archive: <http://lists.gnu.org/pipermail/help-flex>
Spin-offs of this method is that different encodings can be mixed in a
single lexer, and that the Big/Low endian issue becomes resolved.

Then, when using Flex to write a Unicode lexer, one wants to cover up all
combinations, also those illegal. This includes not only the full 32-bits,
but also the overloaded numbers. When writing a Unicode scanner, one will of
course just let the lexer generate some error message. Some other scanner
may want to attempt to resynchronize. So it from this general standpoint
more convenient that Flex does not make any decision about what to do for
those values.

Of course, Flex can do simply what it wants in those cases. But it good to
have some general conventions of those values as well.

Hans Aberg

Next message: Hans Aberg: "RE: 32'nd bit & UTF-8"
Previous message: Michael Everson: "ISO 15924 update"
Maybe in reply to: Hans Aberg: "32'nd bit & UTF-8"
Next in thread: Asmus Freytag: "Re: 32'nd bit & UTF-8"
Reply: Asmus Freytag: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 17 2005 - 18:54:21 CST