Re: Best practices for replacing UTF-8 overlongs

From: Ken Whistler <kenwhistler_at_att.net>
Date: Tue, 20 Dec 2016 08:59:11 -0800

Doug,

On 12/19/2016 6:08 PM, Doug Ewell wrote:
> I thought there was a corrigendum or other, comparatively recent
> addition to the Standard that spelled out how replacement characters
> are supposed to be substituted for invalid code unit sequences --
> something about detecting maximally long sequences. I'll look when I
> have a chance.
>
You found the resulting text in TUS 9.0, p. 126 - 129. The origin of the
text there about best practices for using U+FFFD was the discussion and
resolution of PRI #121 in August, 2008:

http://www.unicode.org/review/pr-121.html

That was discussed at UTC #116. See the minutes:

http://www.unicode.org/L2/L2008/08253.htm

There was feedback at the time advocating the 3rd option, rather than
the 2nd one that was eventually chosen by the UTC. See:

http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt

The actual text that resulted was first published in Unicode 5.2, p. 95:

http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf

Contrast that with the text in Unicode 5.0, which had no extended
discussion about handling conversion errors there. The Unicode 5.2 text
was later expanded with more definitions and explanation, to what you
see now in Unicode 9.0.

--Ken
Received on Tue Dec 20 2016 - 11:00:01 CST

This archive was generated by hypermail 2.2.0 : Tue Dec 20 2016 - 11:00:02 CST