Re: 32'nd bit & UTF-8

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jan 17 2005 - 13:12:02 CST

  • Next message: Peter Constable: "RE: [hebrew] Re: Hebrew combining classes (was ISO 10646 compliance and EU law)"

    Hans Aberg asked:

    > Are there any good reasons for UTF-32 to exclude the 32'nd bit of an encoded
    > 4-byte?

    Yes. In fact there are good reasons for it to exclude the 22nd
    through 31st bits, as well.

    UTF-32 is only *defined* on the range U+0000..U+10FFFF.
    (Actually, U+0000..U+D7FF, U+E000..U+10FFFF.)

    > I.e, the 6-byte combinations
    > 111111xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    > where the first x = 1.
    >
    > With a full 32-bit encoding, one can also use UTF-8 to encoding binary data.

    No, one cannot. Using UTF-8 to encode binary data is horribly
    non-conformant. UTF-8 is not a representation for binary data,
    it is a character encoding form for *encoded character* in
    Unicode/10646, defined on the same range of code points as
    UTF-32.

    > It also simplifies somewhat the implementation of Unicode in lexer
    > generators (such as Flex): The leading byte then covers all 256
    > combinations. All 2^32 numbers should probably be there for generating
    > proper lexer error messages.

    There is nothing preventing your lexer implementation from simply
    putting UTF-32 data into 32-bit registers as 32-bit unsigned
    integers. Everybody does that. And all 32-bit integral values are
    there for generating whatever lexer error messages you want.
    But the valid *character* values are only U+0000..U+D7FF, U+E000..
    U+10FFFF. If your lexer runs into some other numerical value,
    then it is dealing with bogus data that cannot be interpreted
    as a Unicode character.

    --Ken Whistler

    >
    > Hans Aberg
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Jan 17 2005 - 13:15:53 CST