Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Mon Jan 17 2005 - 18:19:25 CST

  • Next message: Hans Aberg: "RE: 32'nd bit & UTF-8"

    At 11:12 -0800 2005/01/17, Kenneth Whistler wrote:
    >> Are there any good reasons for UTF-32 to exclude the 32'nd bit of an encoded
    >> 4-byte?
    >
    >Yes. In fact there are good reasons for it to exclude the 22nd
    >through 31st bits, as well.
    >
    >UTF-32 is only *defined* on the range U+0000..U+10FFFF.
    >(Actually, U+0000..U+D7FF, U+E000..U+10FFFF.)

    Sorry, I have been somewhat sloppy in my terminology. I am well aware of
    that strictly speaking UTF-8 and UTF-32 are defined only for those 21-bit
    values. But in <http://www.cl.cam.ac.uk/~mgk25/unicode.html> there is an
    extension handling 32-bit numbers. It seems strange that this extension
    excludes the full 32-bits.

    >> I.e, the 6-byte combinations
    >> 111111xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    >> where the first x = 1.
    >>
    >> With a full 32-bit encoding, one can also use UTF-8 to encoding binary data.
    >
    >No, one cannot. Using UTF-8 to encode binary data is horribly
    >non-conformant. UTF-8 is not a representation for binary data,
    >it is a character encoding form for *encoded character* in
    >Unicode/10646, defined on the same range of code points as
    >UTF-32.

    I will describe my motivations in more detail below.

    >> It also simplifies somewhat the implementation of Unicode in lexer
    >> generators (such as Flex): The leading byte then covers all 256
    >> combinations. All 2^32 numbers should probably be there for generating
    >> proper lexer error messages.
    >
    >There is nothing preventing your lexer implementation from simply
    >putting UTF-32 data into 32-bit registers as 32-bit unsigned
    >integers. Everybody does that. And all 32-bit integral values are
    >there for generating whatever lexer error messages you want.
    >But the valid *character* values are only U+0000..U+D7FF, U+E000..
    >U+10FFFF. If your lexer runs into some other numerical value,
    >then it is dealing with bogus data that cannot be interpreted
    >as a Unicode character.

    I am not really speaking about the implementation for a lexer directly, but
    the implementation of lexers using a lexer (scanner) generator such as Flex.
    It is not known at least to the Flex group how to extend Flex to Unicode.
    Using UTF-32 and 32-bit directly would require that one implements table
    compression.

    So last week I thought that one instead translates UTF-8 and UTF-32B/L
    regular expressions into 1-byte regular expressions, and the lets Flex
    expand those. I wrote some functions in Haskell using Hugs doing just this.
    For details, see
      List-Info: <http://lists.gnu.org/mailman/listinfo/help-flex>
      List-Archive: <http://lists.gnu.org/pipermail/help-flex>
    Spin-offs of this method is that different encodings can be mixed in a
    single lexer, and that the Big/Low endian issue becomes resolved.

    Then, when using Flex to write a Unicode lexer, one wants to cover up all
    combinations, also those illegal. This includes not only the full 32-bits,
    but also the overloaded numbers. When writing a Unicode scanner, one will of
    course just let the lexer generate some error message. Some other scanner
    may want to attempt to resynchronize. So it from this general standpoint
    more convenient that Flex does not make any decision about what to do for
    those values.

    Of course, Flex can do simply what it wants in those cases. But it good to
    have some general conventions of those values as well.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Mon Jan 17 2005 - 18:54:21 CST