Re: UTF-8 stress test file?

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Oct 12 2004 - 22:34:49 CST

  • Next message: Jon Hanna: "RE: outside decomposed, inside precomposed"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > Have you read the file content? It clearly and explicitly speaks about
    > UTF-16, which has nothing to do in a text file for UTF-8, unless the
    > file was used as a test for CESU-8 (which is not UTF-16 as well, and
    > not even UTF-8).

    It includes surrogate code points in the range U+D800 through U+DFFF.
    Surrogates are a UTF-16 concept, but these are encoded as (malformed)
    UTF-8 sequences. The file mentions UTF-16 only in the context of these
    surrogate code points.

    Saying "the file mixes UTF-8 and UTF-16" implies that Markus was
    confused or made some sort of mistake, when in fact this test was very
    intentional. The point of including them was to test whether a UTF-8
    decoder would interpret them (it should not). It's actually a great
    test; most decoders do not pass it.

    > My statement was correct: it is based on the fact that the test file
    > was created for the older (RFC version) of UTF-8 used in old versions
    > of ISO 10646, and never referenced (at least explicitly until the
    > v4.01 clarification) by Unicode in any version.

    Indeed, the file does assume that Unicode scalar values above U+10FFFF
    are valid, which is no longer true. But the statement "the file mixes
    UTF-8 and UTF-16" is not correct.

    Later:

    > - [quote](...) Experience so far suggests
    > that most first-time authors of UTF-8 decoders find at least one
    > serious problem in their decoder by using this file.[/quote]
    >
    > This suggests to the reader that if its browser or editor does not
    > display the contained test text as indicated, there's a problem in
    > that application.

    There certainly is.

    > But given that the file is not conforming to UTF-8 because of the
    > "errors" it contains *on purpose*, No assumption should be made about
    > how the browser or text editor will behave with the content of that
    > file. Any difference with what is "expected" by the text is really not
    > a bug, given that the whole file is incorrect and is *not* UTF-8
    > encoded.

    It is UTF-8–encoded, but with errors.

    > In fact, if your browser or editor still allows to view it as if it
    > was UTF-8, and inidicates to the user that it is UTF-8 encoded without
    > warning the user about the encoding violations that should be
    > detected, I really think that this browser or editor is not
    > conforming. A conforming browser or editor should load that document
    > without encoding violation problems, assuming it is encoded instead
    > with ISO-8859-1 or ISO-8859-2 or any other complete 8-bit encoding (an
    > encoding that has no invalid code position, so ISO-8859-4 should not
    > work without similar warnings).

    The whole point of this test is that (1) the file is encoded in UTF-8,
    but wait, (2) some of the sequences are invalid. You can assume that
    the file is not UTF-8 if you like, but that destroys the whole premise
    of the test. It is as if you had a PNG image file that contained an
    invalid chunk length or CRC or something, and concluded that the file
    was not a PNG image file after all, but some other kind of file.

    Re-interpreting the file as ISO 8859-1 or some other encoding is one way
    of handling the problem of invalid sequences, but any application that
    does this and still claims to be Unicode-conformant had better let me
    know it is going to do such a thing, and should give me a chance to turn
    the "feature" off (and should also let me choose the fallback encoding;
    not everybody wants ISO 8859-1).

    > The only thing that could be said is that the document respects only
    > the ISO 10646-1:2000 standard, but not its later version and not
    > Unicode (so a browser or editor could still accept the document as
    > being encoded with UTF-8:2000, but not with UTF-8.

    The file contains all sorts of invalid UTF-8 sequences, including many
    that were never valid under any specification of UTF-8.

    > - [quote](...) All lines in this file are exactly 79 characters long
    > (plus the line feed). In addition, all lines end with "|", except for
    > the two test lines 2.1.1 and 2.2.1, which contain non-printable ASCII
    > controls U+0000 and U+007F. If you display this file with a fixed-
    > width font, these "|" characters should all line up in column 79
    > (right margin).[/quote]
    >
    > Nothing is wrong if lines are displayed with more or less characters,
    > or if "|" characters are not vertically aligned when using fixed
    > fonts.

    I agree here. The decoder gets to decide how to render the invalid
    sequences, as long as it doesn't act as though they are valid. See the
    next paragraph, which explains this.

    > You should see the Greek word 'kosme': "κόσμε"
    > (...) [/quote]
    >
    > You can see the Greek word here in this message (because this message
    > is properly UTF-8 encoded), but nothing is wrong in your editor or
    > browser if the word is not readable as indicated, and you see for
    > example the string "κόσμε" when your editor or browser loads the
    > file as an ISO-8859-1 text.

    Treating the file as something other than UTF-8 misses the point
    entirely.

    > - All the section 3 "Malformed sequences" should not be readable at
    > all, or could display random characters when the text is loaded as
    > ISO-8859-1.

    Treating the file as something other than UTF-8 misses the point
    entirely.

    > - Same thing for section 4 "Overlong sequences" (prohibited in UTF-8,
    > but tolerated in UTF-8:2000 i.e. the RFC version used by ISO
    > 10646:2000). If you see "?" characters without other warnings, your
    > browser is not conforming exactly like browsers that would display the
    > indicated slash "/".

    I agree.

    > - Section 5 "Illegal code positions" (single and paired "UTF-16"
    > surrogates) is the one that should immediately throw an exception in
    > the browser's UTF-8 decoder to force it retry with another encoding
    > (possibly with UTF-8:2000, or with ISO-8859-1). Nothing is wrong in
    > your browser if you see sequences like "í €" or "í¿¿"when the file is
    > loaded as Windows-1252, or if lines do not line up or have strange
    > layout when the file is loaded as ISO-8859-1.

    Treating the file as something other than UTF-8 misses the point
    entirely.

    > - Subsection 5.3 "Other illegal code positions" also forgets all
    > illegal *code points* (not "code positions" !) that are permanently
    > reserved in the 16 other planes (out of the BMP), and illegal
    > positions found in the Arabic compatibility block.

    Perhaps surprisingly, noncharacters are not invalid sequences. UTF-8
    and other Unicode encoding forms and schemes are required to pass them
    unaltered, even though they are not supposed to be interpreted as
    characters (i.e. assigned to glyphs). But you're right, the test file
    does miss these cases.

    > So who's puzzling here? Not me! It's the content of the text itself.

    I think there is still some confusion over the intent of the test, which
    is to assume the file is UTF-8 and then identify the invalid UTF-8
    sequences. Re-interpreting the file as ISO 8859-1 or as a less
    restrictive form of UTF-8 defeats the purpose of having a test.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Tue Oct 12 2004 - 22:37:57 CST