RE: 32'nd bit & UTF-8

From: Hans Aberg (
Date: Mon Jan 17 2005 - 18:19:36 CST

  • Next message: Asmus Freytag: "Re: Mystery of Circled S solved"

    At 18:47 +0000 2005/01/17, Jon Hanna wrote:
    >> Are there any good reasons for UTF-32 to exclude the 32'nd
    >> bit of an encoded
    >> 4-byte? I.e, the 6-byte combinations
    >> 111111xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    >> where the first x = 1.
    >Because there is no such character. Even the 5 and 6-octet combinations
    >allowable by ISO 10646 won't identify a Unicode character (or an assigned
    >ISO 10646 character).

    See my reply to Kenneth Whistler, which details my deliberate sloppiness on
    this points, as well on motivations.

    >> With a full 32-bit encoding, one can also use UTF-8 to
    >> encoding binary data.
    >I really find it hard to see the advantage to this.

    One advantage of this approach is that it is insensitive to the big/low
    endian issue. So it could be used for that as well.

    >> It also simplifies somewhat the implementation of Unicode in lexer
    >> generators (such as Flex):
    >Not as much as basing the lexer on characters rather than octets does.

    Again, more details are given in my reply to Kenneth Whistler. This is not
    the question of how to implement a specific Unicode lexer, but how to
    implement Unicode into a lexer generator such as Flex. Working directly with
    Unicode numbers would require table compression algorithms, and the lexer
    would still have to make a choice of Unicode encoding.

      Hans Aberg

    This archive was generated by hypermail 2.1.5 : Mon Jan 17 2005 - 18:55:11 CST