Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: Asmus Freytag via Unicode <>
Date: Tue, 16 May 2017 11:13:53 -0700
On 5/16/2017 10:30 AM, Shawn Steele via Unicode wrote:
Would you advocate replacing

  e0 80 80


  U+FFFD U+FFFD U+FFFD     (1)

rather than

  U+FFFD                   (2)

It’s pretty clear what the intent of the encoder was there, I’d say, and while we certainly don’t 
want to decode it as a NUL (that was the source of previous security bugs, as I recall), I also don’t
see the logic in insisting that it must be decoded to *three* code points when it clearly only 
represented one in the input.
It is not at all clear what the intent of the encoder was - or even if it's not just a problem with the data stream.  E0 80 80 is not permitted, it's garbage.  An encoder can't "intend" it.

A) the "encoder" was attempting to be malicious, in which case the whole thing is suspect and garbage, and so the # of FFFD's doesn't matter, or

B) the "encoder" is completely broken, in which case all bets are off, again, specifying the # of FFFD's is irrelevant.

C) The data was corrupted by some other means.  Perhaps bad concatenations, lost blocks during read/transmission, etc.  If we lost 2 512 byte blocks, then maybe we should have a thousand FFFDs (but how would we known?)


Clearly, for the receiver, nothing reliable can be deduced about the raw byte stream once an FFFD has been inserted.

For the receiver, there's a fourth case that might have been:

D) the raw UTF-8 stream contained a valid U+FFFD

Received on Tue May 16 2017 - 13:14:07 CDT

This archive was generated by hypermail 2.2.0 : Tue May 16 2017 - 13:14:07 CDT