From: Theodore H. Smith (delete@elfdata.com)
Date: Sun Oct 10 2004 - 16:03:48 CST
It gets worse with the file at:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
' According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
receiving UTF-8 shall interpret a "malformed sequence in the same way
that it interprets a character that is outside the adopted subset" '
That behaviour is clearly out of date. Unicode added some new standard
for security reasons. The text should be rejected instead, OR the
malformed UTF-8 should be modified upon loading to make it conforming
UTF-8, basically stripping out the bad bytes or replacing the bad
bytes.
As long as we don't pass any invalid UTF-8 to client apps/code, and we
don't process any invalid UTF-8, we are fine, so modifying the bytes of
the UTF8 text before doing anything with it, can in some circumstances
work.
-- Theodore H. Smith - Software Developer. http://www.elfdata.com
This archive was generated by hypermail 2.1.5 : Sun Oct 10 2004 - 16:04:56 CST