Re: UTF-8 stress test file?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Oct 12 2004 - 19:22:53 CST

Next message: Doug Ewell: "Re: UTF-8 stress test file?"

Previous message: James Kass: "Re: bit notation in ISO-8859-x is wrong"
In reply to: Philipp Reichmuth: "Re: UTF-8 stress test file?"
Next in thread: D. Starner: "Re: UTF-8 stress test file?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Philipp Reichmuth" <reichmuth@web.de>
> Don't you think you are stretching things a bit? This is an UTF-8 parser
> stress test file. If an application opens it in a different encoding,
> well, of course the results will be different, and things will not look
> UTF-8-ish. Again, this is a non-issue. It's like distributing a Linux
> binary for testing something and then getting complaints that it doesn't
> work under DOS and that it shouldn't make assumptions on operating
> systems.

That's not the good point I wanted to focus. Things CANNOT look "UTF-8-ish"
in a UTF-8 conforming editor or browser that will correctly detect all
encoding errors in that file, and thus will never properly present the text
properly aligned. What a conforming editor or browser *may* eventually do is
to recover and mandatorily signal to the user the positions of errors
(possibly by using a replacement glyph as if each error was coding a U+FFFD
substitute), but how many errors will you signal given that the error
recovery level is not defined in the Unicode/ISO/IEC UTF-8 standard?
Even in the old ISO/IEC10646 standard, recovery is only possible after
errors only if uninterpretable byte sequences were still properly parsed
into sub-sequences (of unspecified length) where a substiture could be used.

The problem is in the length of each invalid byte sequence; for example, if
there's a 4-bytes old UTF-8 encoding sequence (or longer) the error will be
detected at the first byte, recovery will take place at the second byte
after the first byte as been interpreted as a invalid sequence represented
by a substitute glyph, but then each of the immediately following trailing
byte will signal an error.

Suppose that the parser recovers until it can find a new starter byte, it
will still need to parse this byte to see if its a leading byte for a longer
sequence, so the recovery is not necessarily immediately possible after the
first invalid byte, or after the supposed end of the byte sequence. Now if
the parser will reover by skipping all bytes until a valid sequence is
found, there will be only 1 encoding error thrown on the leading byte, and
only 1 substitution glyph.

We are navigating within unspecified areas where error recovery after
decoding errors is not defined in the current UTF-8 standard itself (not
even in the old RFC version with ISO/IEC 10646-1:2000)

And as I said, the document itself is not complete enough, because it
forgets other invalid sequences for non-characters.

Next message: Doug Ewell: "Re: UTF-8 stress test file?"
Previous message: James Kass: "Re: bit notation in ISO-8859-x is wrong"
In reply to: Philipp Reichmuth: "Re: UTF-8 stress test file?"
Next in thread: D. Starner: "Re: UTF-8 stress test file?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Oct 12 2004 - 19:26:46 CST