Re: UTF-8 stress test file?

From: Philipp Reichmuth (
Date: Tue Oct 12 2004 - 16:59:30 CST

  • Next message: Richard Cook: "outside decomposed, inside precomposed"

    Philippe Verdy schrieb:
    > Examples of bad assumptions that a reader could make:
    > - [quote](...) Experience so far suggests
    > that most first-time authors of UTF-8 decoders find at least one
    > serious problem in their decoder by using this file.[/quote]
    > This suggests to the reader that if its browser or editor does not
    > display the contained test text as indicated, there's a problem in that
    > application.

    Well, to me it didn't. After all, the purpose of this file is to be a
    stress test for UTF-8 decoders, as indicated in line 1. By testing
    their decoders on this file, UTF-8 decoder authors tend to find problems
    of some kind in their programs. So where is the problem again?

    > But given that the file is not conforming to UTF-8 because
    > of the "errors" it contains *on purpose*, No assumption should be made
    > about how the browser or text editor will behave with the content of
    > that file.

    Where is any such assumption being made? Actually, most of your
    statements on what is "wrong" with this file are based on the idea that
    it makes some expectations on parser behaviour. However, in paragraph
    1, this is explicitly excluded. So what is the point?

    > A conforming browser or editor should load that document without
    > encoding violation problems, assuming it is encoded instead with
    > ISO-8859-1 [...]

    While possibly being technically correct behaviour, that would sort of
    defeat the purpose of testing an UTF-8 decoder, wouldn't it?

    > Nothing is wrong if lines are displayed with more or less characters, or
    > if "|" characters are not vertically aligned when using fixed fonts.

    Assuming, however, that the file is used for its purpose of testing an
    UTF-8 decoder, all lines should indeed align.

    >> You should see the Greek word 'kosme': "κόσμε"
    >> (...) [/quote]
    > You can see the Greek word here in this message (because this message is
    > properly UTF-8 encoded), but nothing is wrong in your editor or browser
    > if the word is not readable as indicated, and you see for example the
    > string "κόσμε" when your editor or browser loads the file as an
    > ISO-8859-1 text.

    Don't you think you are stretching things a bit? This is an UTF-8
    parser stress test file. If an application opens it in a different
    encoding, well, of course the results will be different, and things will
    not look UTF-8-ish. Again, this is a non-issue. It's like distributing
    a Linux binary for testing something and then getting complaints that it
    doesn't work under DOS and that it shouldn't make assumptions on
    operating systems.

    And so on.


    This archive was generated by hypermail 2.1.5 : Tue Oct 12 2004 - 17:02:17 CST