Re: UTF-8 stress test file?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 11 2004 - 08:16:22 CST

Next message: Chris Jacobs: "Re: bit notation in ISO-8859-x is wrong"

Previous message: James Kass: "Re: bit notation in ISO-8859-x is wrong"
In reply to: Terje Bless: "Re: UTF-8 stress test file?"
Next in thread: Theodore H. Smith: "Re: UTF-8 stress test file?"
Reply: Theodore H. Smith: "Re: UTF-8 stress test file?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Terje Bless" <link@pobox.com>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Theodore H. Smith <delete@elfdata.com> wrote:
>
>>I'd like to see a UTF-8 stress test file.
>
> The top result on Google for the query “UTF-8 Stress Test” is
> <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>.

This test file is out of date and incorrect: it uses "Unicode" incorrectly,
where it should relate to the old RFC definition of UTF-8 referenced by
previous versions of ISO/IEC 10646: in that file, all UTF-8 sequences with 5
bytes or more are invalid (they are not "boundary cases").
So the list of "impossible bytes" is longer than documented there.
The more exact definition of UTF-8, shared now by Unicode and by the current
version of ISO/IEC 10646 is documented in the conformance section of the
Unicode standard.
Still, this file will be useful to determine if your browser or editor
effectively shows substitutes (like "?") where it should for all invalid
sequences. But if your browser just says that this is not a UTF-8 encoded
file, it will be right, if it does not display it at all:
- the file mixes UTF-8 and UTF-16
- invalid sequences may raise an exception that informs the user that the
file can't be decoded.
- a browser or text editor may as well attempt to trigger its
charset-autodetection mechanism to try finding another charset. If the file
is then displayed assuming ISO-8859-1 and showing each byte of UTF-8 or
UTF-16 sequences as if they were ISO-8859-1 characters, it will not be a
conformance problem for the browser or text editor.

Next message: Chris Jacobs: "Re: bit notation in ISO-8859-x is wrong"
Previous message: James Kass: "Re: bit notation in ISO-8859-x is wrong"
In reply to: Terje Bless: "Re: UTF-8 stress test file?"
Next in thread: Theodore H. Smith: "Re: UTF-8 stress test file?"
Reply: Theodore H. Smith: "Re: UTF-8 stress test file?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Oct 11 2004 - 08:21:30 CST