Re: UTF-8 stress test file?

From: Theodore H. Smith (
Date: Sun Oct 10 2004 - 15:59:25 CST

  • Next message: Theodore H. Smith: "Re: UTF-8 stress test"

    >> I'd like to see a UTF-8 stress test file.
    >> It should consist of lines of UTF-8, separated each by a newline.
    >> Each line should be malformed. Also, some idea of how to deal with
    >> the malformed UTF-8 should be noted in a separate file.
    >> Really, I just want some way to verify that I can detect every kind
    >> of UTF-8 wrongness. I have some code I adapted from, but
    >> I want to make sure my adaptions haven't broken the code.

    "This file is not meant to be a conformance test. It does
    not prescribes any particular outcome and therefore there is no way to
    "pass" or "fail" this test file, even though the texts suggests a
    preferable decoder behaviour at some places."

    I'm wondering if has a proper conformance test? If not, I
    suggest they make one. One where we had each test separated by a single
    newline, and no non-ttest lines existing... less they wanted to make
    some kind of "comment line" which is easy to parse (lets say starting
    the line with "#").

    For me to use that test programmatically, I'll need to break out my
    non-UTF-8 aware text editor, delete all the non test lines, and then
    separate out the good and the bad UTF8 into different files! That way I
    can use readline type code to do my UTF-8 verification.

    It would be nice if someone had a "automated test ready" UTF-8 file.

    If not, I'll modify this one and then put the results up on my website,
    someday. (week or so).

         Theodore H. Smith - Software Developer.

    This archive was generated by hypermail 2.1.5 : Sun Oct 10 2004 - 16:01:28 CST