Re: UTF-8 stress test file?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Oct 12 2004 - 19:22:53 CST

  • Next message: Doug Ewell: "Re: UTF-8 stress test file?"

    From: "Philipp Reichmuth" <reichmuth@web.de>
    > Don't you think you are stretching things a bit? This is an UTF-8 parser
    > stress test file. If an application opens it in a different encoding,
    > well, of course the results will be different, and things will not look
    > UTF-8-ish. Again, this is a non-issue. It's like distributing a Linux
    > binary for testing something and then getting complaints that it doesn't
    > work under DOS and that it shouldn't make assumptions on operating
    > systems.

    That's not the good point I wanted to focus. Things CANNOT look "UTF-8-ish"
    in a UTF-8 conforming editor or browser that will correctly detect all
    encoding errors in that file, and thus will never properly present the text
    properly aligned. What a conforming editor or browser *may* eventually do is
    to recover and mandatorily signal to the user the positions of errors
    (possibly by using a replacement glyph as if each error was coding a U+FFFD
    substitute), but how many errors will you signal given that the error
    recovery level is not defined in the Unicode/ISO/IEC UTF-8 standard?
    Even in the old ISO/IEC10646 standard, recovery is only possible after
    errors only if uninterpretable byte sequences were still properly parsed
    into sub-sequences (of unspecified length) where a substiture could be used.

    The problem is in the length of each invalid byte sequence; for example, if
    there's a 4-bytes old UTF-8 encoding sequence (or longer) the error will be
    detected at the first byte, recovery will take place at the second byte
    after the first byte as been interpreted as a invalid sequence represented
    by a substitute glyph, but then each of the immediately following trailing
    byte will signal an error.

    Suppose that the parser recovers until it can find a new starter byte, it
    will still need to parse this byte to see if its a leading byte for a longer
    sequence, so the recovery is not necessarily immediately possible after the
    first invalid byte, or after the supposed end of the byte sequence. Now if
    the parser will reover by skipping all bytes until a valid sequence is
    found, there will be only 1 encoding error thrown on the leading byte, and
    only 1 substitution glyph.

    We are navigating within unspecified areas where error recovery after
    decoding errors is not defined in the current UTF-8 standard itself (not
    even in the old RFC version with ISO/IEC 10646-1:2000)

    And as I said, the document itself is not complete enough, because it
    forgets other invalid sequences for non-characters.



    This archive was generated by hypermail 2.1.5 : Tue Oct 12 2004 - 19:26:46 CST