Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Tue Jan 18 2005 - 18:09:32 CST

  • Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"

    On 2005/01/18 21:21, Jon Hanna at jon@hackcraft.net wrote:

    >> According to <http://www.cl.cam.ac.uk/~mgk25/unicode.html>,
    >> UTF is short for
    >> UCS Transformation Format, where UCS stands for Universal
    >> Character Set.
    >
    > Minor incidences of a website being out of date aren't really relevant here
    > unless the website is unicode.org.

    This is quite hard to interpret.

    >> The Unicode standard is like Big Brother in George Orwell's
    >> "1984", making
    >> it possible to only speak about what is right, but not what
    >> is wrong. The
    >> lexer generator needs to be able to speak about what is wrong
    >> as well, in
    >> order to give proper handling to that.
    >
    > Let me get this straight:
    > You are processing UTF-8.
    > You want to find errors.
    > When you find errors the text wasn't UTF-8.
    > Therefore you want the erroneous sequences to be allowed.
    > In which case they won't be errors.
    > And then you can find the errors.
    > That aren't there any more.
    > This is beginning to feel like an Escher painting.

    A lexer generator like Flex does not process Unicode directly, it generates
    a lexer that processes bytes. And the question is how to get it to emulate
    Unicode a lexer, if the fellow who writes the input grammar so wishes. Then
    it is up the lexer writer, not the developer of the Flex program, to decide
    what happens in the case of illegal Unicode numbers. The bets way to get a
    feel for it is to start using it. Flex is used in connection with the parser
    generator Bison. Info is found at GNU <http://gnu.org>. Also check out the
    Usenet newsgroup comp.compilers.

    >> Besides, even though Unicode has declared to never use more
    >> than 21 bits, in
    >> the track record, Unicode has reneged on such promises. It
    >> might be prudent
    >> to knock down a full 32-bit encoding, declaring UTF-8/32 to
    >> be subsets of
    >> that.
    >
    > Why should we cater for the "a full 32-bit encoding". Piffle to you and your
    > obsolete technology, demand full 64-bit encoding now and get 128-bit on the
    > roadmap! My choice of unit-size arbitrarily based on particular processor
    > capabilities is so much cooler than yours.

    32 bits is the smallest natural computer alignment if 21 bits, and will
    therefore have to be dealt with anyhow when writing a lexer.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 18:13:12 CST