Re: Best practices for replacing UTF-8 overlongs

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Tue, 20 Dec 2016 06:56:53 +0000

On Mon, 19 Dec 2016 20:54:31 -0700
Doug Ewell <doug_at_ewellic.org> wrote:

> There isn't much to be gained by collapsing the bad bytes to a single
> replacement character. However, doing so does remove the information
> about how many bytes were invalid and that may have value to a user
> in assessing how much of the document is suspect.

How many bytes are invalid in the sequence F0 30 A0 B0? There might
just be one bit error in the data stream.

The chief advantage of collapsing comes in the simplicity of the
decoding logic. The natural logic is to read the requisite number of
continuation bytes, converting the whole to a codepoint value, and then
check that the codepoint value is allowed in UTF-8. Obviously one also
has to check that the requisite continuation bytes are present.

Arguments then come down to the use or otherwise of library functions
and the number of error-reporting mechanisms to be used.

Richard.
Received on Tue Dec 20 2016 - 00:57:33 CST

This archive was generated by hypermail 2.2.0 : Tue Dec 20 2016 - 00:57:33 CST