RE: 32'nd bit & UTF-8

From: Jon Hanna (jon@hackcraft.net)
Date: Tue Jan 18 2005 - 14:21:05 CST

  • Next message: Jon Hanna: "RE: 32'nd bit & UTF-8"

    > According to <http://www.cl.cam.ac.uk/~mgk25/unicode.html>,
    > UTF is short for
    > UCS Transformation Format, where UCS stands for Universal
    > Character Set.

    Minor incidences of a website being out of date aren't really relevant here
    unless the website is unicode.org.

    > The Unicode standard is like Big Brother in George Orwell's
    > "1984", making
    > it possible to only speak about what is right, but not what
    > is wrong. The
    > lexer generator needs to be able to speak about what is wrong
    > as well, in
    > order to give proper handling to that.

    Let me get this straight:
    You are processing UTF-8.
    You want to find errors.
    When you find errors the text wasn't UTF-8.
    Therefore you want the erroneous sequences to be allowed.
    In which case they won't be errors.
    And then you can find the errors.
    That aren't there any more.
    This is beginning to feel like an Escher painting.

    > Besides, even though Unicode has declared to never use more
    > than 21 bits, in
    > the track record, Unicode has reneged on such promises. It
    > might be prudent
    > to knock down a full 32-bit encoding, declaring UTF-8/32 to
    > be subsets of
    > that.

    Why should we cater for the "a full 32-bit encoding". Piffle to you and your
    obsolete technology, demand full 64-bit encoding now and get 128-bit on the
    roadmap! My choice of unit-size arbitrarily based on particular processor
    capabilities is so much cooler than yours.



    This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 14:24:01 CST