Re: Best practices for replacing UTF-8 overlongs

From: Ken Whistler <>
Date: Tue, 20 Dec 2016 08:59:11 -0800


On 12/19/2016 6:08 PM, Doug Ewell wrote:
> I thought there was a corrigendum or other, comparatively recent
> addition to the Standard that spelled out how replacement characters
> are supposed to be substituted for invalid code unit sequences --
> something about detecting maximally long sequences. I'll look when I
> have a chance.
You found the resulting text in TUS 9.0, p. 126 - 129. The origin of the
text there about best practices for using U+FFFD was the discussion and
resolution of PRI #121 in August, 2008:

That was discussed at UTC #116. See the minutes:

There was feedback at the time advocating the 3rd option, rather than
the 2nd one that was eventually chosen by the UTC. See:

The actual text that resulted was first published in Unicode 5.2, p. 95:

Contrast that with the text in Unicode 5.0, which had no extended
discussion about handling conversion errors there. The Unicode 5.2 text
was later expanded with more definitions and explanation, to what you
see now in Unicode 9.0.

