Re: UTF-8 stress test

From: Theodore H. Smith (delete@elfdata.com)
Date: Sun Oct 10 2004 - 16:03:48 CST

  • Next message: Chris Jacobs: "PGP"

    It gets worse with the file at:
    http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

    ' According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
    receiving UTF-8 shall interpret a "malformed sequence in the same way
    that it interprets a character that is outside the adopted subset" '

    That behaviour is clearly out of date. Unicode added some new standard
    for security reasons. The text should be rejected instead, OR the
    malformed UTF-8 should be modified upon loading to make it conforming
    UTF-8, basically stripping out the bad bytes or replacing the bad
    bytes.

    As long as we don't pass any invalid UTF-8 to client apps/code, and we
    don't process any invalid UTF-8, we are fine, so modifying the bytes of
    the UTF8 text before doing anything with it, can in some circumstances
    work.

    --
         Theodore H. Smith - Software Developer.
         http://www.elfdata.com
    


    This archive was generated by hypermail 2.1.5 : Sun Oct 10 2004 - 16:04:56 CST