Re: 32'nd bit & UTF-8

From: Doug Ewell (dewell@adelphia.net)
Date: Mon Jan 17 2005 - 12:46:52 CST

  • Next message: Kenneth Whistler: "Re: 32'nd bit & UTF-8"

    Hans Aberg <haberg at math dot su dot se> wrote:

    > Are there any good reasons for UTF-32 to exclude the 32'nd bit of an
    > encoded 4-byte? I.e, the 6-byte combinations
    > 111111xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    > where the first x = 1.

    (I think you mean UTF-8 rather than UTF-32.)

    This is not a limit of UTF-8 per se, but of the underlying ISO/IEC 10646
    encoding. It was intentionally designed as a 31-bit value (bit 31 = 0)
    to prevent implementation problems in "signed integer" environments.
    Java, for example, has no unsigned 32-bit integer type.

    > With a full 32-bit encoding, one can also use UTF-8 to encoding binary
    > data. It also simplifies somewhat the implementation of Unicode in
    > lexer generators (such as Flex): The leading byte then covers all 256
    > combinations. All 2^32 numbers should probably be there for generating
    > proper lexer error messages.

    This is not what UTF-8 is for. It is a multi-byte encoding scheme
    intended to cover the entire ISO 10646 and Unicode space while remaining
    ASCII-compatible. Formats intended for arbitrary binary data have
    neither the same requirements nor the same constraints. In fact, for
    randomly occurring binary data, UTF-8 will use 5 or 6 bytes to represent
    a 32-bit value 99.9% of the time, with none of the obvious benefits
    (like control-code transparency) for which special encoding formats for
    binary data are usually employed.

    Allowing the lead byte in a UTF-8 sequence to extend to 0xFF may
    actually complicate the implementation of UTF-8, by breaking the rule
    that the number of bytes in a multi-byte sequence (i.e. more than 1
    byte) can be determined from the number of leading 1-bits in the lead
    byte.

    The leading byte in a UTF-8 sequence does not cover all 256 combinations
    in any event. The values 0x80 through 0xBF are trailing bytes only, and
    0xC0 and 0xC1 will also never occur as the lead byte in a properly
    formed sequence.

    This is all moot anyway, since Unicode and ISO 10646 have restricted the
    range of code points to [0x00, 0x10FFFF], with the result that no 5-byte
    or 6-byte sequences represent valid code points.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Mon Jan 17 2005 - 12:55:13 CST