RE: 32'nd bit & UTF-8

From: Jon Hanna (jon@hackcraft.net)
Date: Mon Jan 17 2005 - 12:47:31 CST

  • Next message: Doug Ewell: "Re: 32'nd bit & UTF-8"

    > Are there any good reasons for UTF-32 to exclude the 32'nd
    > bit of an encoded
    > 4-byte? I.e, the 6-byte combinations
    > 111111xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    > where the first x = 1.

    Because there is no such character. Even the 5 and 6-octet combinations
    allowable by ISO 10646 won't identify a Unicode character (or an assigned
    ISO 10646 character).

    > With a full 32-bit encoding, one can also use UTF-8 to
    > encoding binary data.

    I really find it hard to see the advantage to this.

    > It also simplifies somewhat the implementation of Unicode in lexer
    > generators (such as Flex):

    Not as much as basing the lexer on characters rather than octets does.

    Regards,
    Jon Hanna
    Work: <http://www.selkieweb.com/>
    Play: <http://www.hackcraft.net/>
    Chat: <irc://irc.freenode.net/selkie>



    This archive was generated by hypermail 2.1.5 : Mon Jan 17 2005 - 12:52:34 CST