Re: UTF-8 stress test

From: Theodore H. Smith (delete@elfdata.com)
Date: Sun Oct 10 2004 - 16:03:48 CST

Previous message: Theodore H. Smith: "Re: UTF-8 stress test file?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

It gets worse with the file at:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

' According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
receiving UTF-8 shall interpret a "malformed sequence in the same way
that it interprets a character that is outside the adopted subset" '

That behaviour is clearly out of date. Unicode added some new standard
for security reasons. The text should be rejected instead, OR the
malformed UTF-8 should be modified upon loading to make it conforming
UTF-8, basically stripping out the bad bytes or replacing the bad
bytes.

As long as we don't pass any invalid UTF-8 to client apps/code, and we
don't process any invalid UTF-8, we are fine, so modifying the bytes of
the UTF8 text before doing anything with it, can in some circumstances
work.

--
     Theodore H. Smith - Software Developer.
     http://www.elfdata.com

Next message: Chris Jacobs: "PGP"
Previous message: Theodore H. Smith: "Re: UTF-8 stress test file?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Oct 10 2004 - 16:04:56 CST