RE: Best practices for replacing UTF-8 overlongs

From: Shawn Steele <Shawn.Steele_at_microsoft.com>
Date: Tue, 20 Dec 2016 04:35:51 +0000

So... an input data stream with corrupt UTF-8 basically has (under any scheme being discussed) some number of replacement characters.

Each of those replacement characters indicates at least one garbled byte, but without additional information, they aren't a great indicator of missing bytes.

I'm uncertain of what software would want to do with them at that point. Making assumptions about over-long byte sequences being intelligible seems like it would require deep knowledge of how UTF-8 works, in which case why bother calling a generic UTF-8 decoding API?

"All bets are off" may not be very instructive, however I don't think the "one replacement character per bad byte" or "one replacement character for many bytes" improves that situation at all.

A too-long lead/trail byte could mean that someone used a bad encoder.
It could also mean that someone was trying something malicious
Another possibility would be flipped bits corruption during transmission/storage.
Alternatively, it could be part of a missing sequence.
Confused software could have also done a mixed up copy-paste between applications.
Or bad buffering
Or....

Without additional information, I'm not sure what you expect the software should "know" besides "this data stream is definitely not 100% perfect" (alternatively you wouldn't necessarily know that a valid data stream had not been corrupted into an invalid stream).

The options at that point would seem to be:

* Just keep on going, maybe some user'll fix it later.
* Warn the user somehow (though we're not going to be able to tell them much beyond "corrupted file")
* Reject the data as corrupted and refuse to load it.
* Attempt some sort of repair - that seems unlikely for most applications unless they have unique knowledge of possible corruption modes and have some sort of redundancy built in.

-Shawn

PS: The one thing I *do* know is that (from a program I had to debug once), it is unwise to decode binary executable as UTF-8, (so it contains a large number of replacement characters), and then take the hash of that resulting stream, and then test that to ensure that the binary had not been tampered with. That program could have been rewritten to do most anything without touching the hash!

-----Original Message-----
From: Tex Texin [mailto:textexin_at_xencraft.com]
Sent: Monday, December 19, 2016 6:36 PM
To: Shawn Steele <Shawn.Steele_at_microsoft.com>; 'Doug Ewell' <doug_at_ewellic.org>; 'Unicode Mailing List' <unicode_at_unicode.org>
Cc: 'Karl Williamson' <public_at_khwilliamson.com>
Subject: RE: Best practices for replacing UTF-8 overlongs

Shawn,

Ok, but that begs the questions of what to do...
"All bets are off" is not instructive.

How software behaves in the face of invalid bytes, what it does with them, what it does about them, and how it continues (or not) still needs to be determined.

tex

-----Original Message-----
From: Shawn Steele [mailto:Shawn.Steele_at_microsoft.com]
Sent: Monday, December 19, 2016 5:41 PM
To: Tex Texin; 'Doug Ewell'; 'Unicode Mailing List'
Cc: 'Karl Williamson'
Subject: RE: Best practices for replacing UTF-8 overlongs

IMO, bad bytes == corruption. At that point all bets are off because the machine has no clue "how" it was corrupted. It could just be a single flipped bit lost in transmission. It could have been an attack hack using overlong byte sequences. It could be an entire lost packet/block/sector.

-Shawn

-----Original Message-----
From: Tex Texin [mailto:textexin_at_xencraft.com]
Sent: Monday, December 19, 2016 5:23 PM
To: Shawn Steele <Shawn.Steele_at_microsoft.com>; 'Doug Ewell' <doug_at_ewellic.org>; 'Unicode Mailing List' <unicode_at_unicode.org>
Cc: 'Karl Williamson' <public_at_khwilliamson.com>
Subject: RE: Best practices for replacing UTF-8 overlongs

If there is a short sequence of invalid bytes presumed to be one character, then one vs several replacement characters may not matter. But if it were a longer sequence that might have been several invalidly coded characters, then multiple replacement characters would give a more correct representation of the amount of information that was removed or miscoded.

There isn't much to be gained by collapsing the bad bytes to a single replacement character. However, doing so does remove the information about how many bytes were invalid and that may have value to a user in assessing how much of the document is suspect.

tex

-----Original Message-----
From: Unicode [mailto:unicode-bounces_at_unicode.org] On Behalf Of Shawn Steele
Sent: Monday, December 19, 2016 4:26 PM
To: Doug Ewell; Unicode Mailing List
Cc: Karl Williamson
Subject: RE: Best practices for replacing UTF-8 overlongs

IMO, the first byte of the 2 byte sequence is illegal. So replace it with a single replacement character (hey, I ran into something unintelligible), and move on. Then you encounter a trail byte without a lead byte, so again, it's an illegal byte, so replace it with a single replacement character.

So you end up with two.

Replacing the sequence with a single byte implies some perceived understanding of an intended structure that actually doesn't exist.

I'm curious though what the practical difference would be? If I encountered junk like that in the middle of a string, my string is going to be disrupted by an unexpected replacement character. At that point it's already mangled, so does it really matter if there're two instead of one?

-Shawn

-----Original Message-----
From: Unicode [mailto:unicode-bounces_at_unicode.org] On Behalf Of Doug Ewell
Sent: Monday, December 19, 2016 3:53 PM
To: Unicode Mailing List <unicode_at_unicode.org>
Cc: Karl Williamson <public_at_khwilliamson.com>
Subject: Re: Best practices for replacing UTF-8 overlongs

Karl Williamson wrote:

> It seems counterintuitive to me that the two byte sequence C0 80
> should be replaced by 2 replacement characters under best practices,
> or that E0 80 80 should also be replaced by 2. Each sequence was legal
> in early Unicode versions,

This is overstated at best. Decoders weren't required to detect overlong sequences until 2000, but it was never legal to generate them. This was stated explicitly in RFC 2279 and in Unicode 1.1, Appendix F. Correct use of the instructions and table in RFC 2044 also precluded the creation of overlong sequences.
 

--
Doug Ewell | Thornton, CO, US | ewellic.org
Received on Mon Dec 19 2016 - 22:36:18 CST

This archive was generated by hypermail 2.2.0 : Mon Dec 19 2016 - 22:36:18 CST