Re: UTF-8 stress test file?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 11 2004 - 08:16:22 CST

  • Next message: Chris Jacobs: "Re: bit notation in ISO-8859-x is wrong"

    From: "Terje Bless" <link@pobox.com>
    > -----BEGIN PGP SIGNED MESSAGE-----
    > Hash: SHA1
    >
    > Theodore H. Smith <delete@elfdata.com> wrote:
    >
    >>I'd like to see a UTF-8 stress test file.
    >
    > The top result on Google for the query “UTF-8 Stress Test” is
    > <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>.

    This test file is out of date and incorrect: it uses "Unicode" incorrectly,
    where it should relate to the old RFC definition of UTF-8 referenced by
    previous versions of ISO/IEC 10646: in that file, all UTF-8 sequences with 5
    bytes or more are invalid (they are not "boundary cases").
    So the list of "impossible bytes" is longer than documented there.
    The more exact definition of UTF-8, shared now by Unicode and by the current
    version of ISO/IEC 10646 is documented in the conformance section of the
    Unicode standard.
    Still, this file will be useful to determine if your browser or editor
    effectively shows substitutes (like "?") where it should for all invalid
    sequences. But if your browser just says that this is not a UTF-8 encoded
    file, it will be right, if it does not display it at all:
    - the file mixes UTF-8 and UTF-16
    - invalid sequences may raise an exception that informs the user that the
    file can't be decoded.
    - a browser or text editor may as well attempt to trigger its
    charset-autodetection mechanism to try finding another charset. If the file
    is then displayed assuming ISO-8859-1 and showing each byte of UTF-8 or
    UTF-16 sequences as if they were ISO-8859-1 characters, it will not be a
    conformance problem for the browser or text editor.



    This archive was generated by hypermail 2.1.5 : Mon Oct 11 2004 - 08:21:30 CST