RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 from Shawn Steele via Unicode on 2017-05-16 (Unicode Mail List Archive)

From: Shawn Steele via Unicode <unicode_at_unicode.org>
Date: Tue, 16 May 2017 17:30:01 +0000

> Would you advocate replacing

> e0 80 80

> with

> U+FFFD U+FFFD U+FFFD (1)

> rather than

> U+FFFD (2)

> It’s pretty clear what the intent of the encoder was there, I’d say, and while we certainly don’t
> want to decode it as a NUL (that was the source of previous security bugs, as I recall), I also don’t
> see the logic in insisting that it must be decoded to *three* code points when it clearly only
> represented one in the input.

It is not at all clear what the intent of the encoder was - or even if it's not just a problem with the data stream. E0 80 80 is not permitted, it's garbage. An encoder can't "intend" it.

Either
A) the "encoder" was attempting to be malicious, in which case the whole thing is suspect and garbage, and so the # of FFFD's doesn't matter, or

B) the "encoder" is completely broken, in which case all bets are off, again, specifying the # of FFFD's is irrelevant.

C) The data was corrupted by some other means. Perhaps bad concatenations, lost blocks during read/transmission, etc. If we lost 2 512 byte blocks, then maybe we should have a thousand FFFDs (but how would we known?)

-Shawn
Received on Tue May 16 2017 - 12:31:03 CDT

This archive was generated by hypermail 2.2.0 : Tue May 16 2017 - 12:31:04 CDT