Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: David Starner via Unicode <>
Date: Tue, 16 May 2017 09:29:09 +0000

On Tue, May 16, 2017 at 1:45 AM Alastair Houghton <> wrote:

> That’s true anyway; imagine the database holds raw bytes, that just happen
> to decode to U+FFFD. There might seem to be *two* names that both contain
> U+FFFD in the same place. How do you distinguish between them?

If the database holds raw bytes, then the name is a byte string, not a
Unicode string, and can't contain U+FFFD at all. It's a relatively easy
rule to make and enforce that a string in a database is a validly formatted
string; I would hope that most SQL servers do in fact reject malformed
UTF-8 strings. On the other hand, I'd expect that an SQL server would
accept U+FFFD in a Unicode string.

> I don’t see a problem; the point is that where a structurally valid UTF-8
> encoding has been used, albeit in an invalid manner (e.g. encoding a number
> that is not a valid code point, or encoding a valid code point as an
> over-long sequence), a single U+FFFD is appropriate. That seems a
> perfectly sensible rule to adopt.

It seems like a perfectly arbitrary rule to adopt; I'd like to assume that
the only source of such UTF-8 data is willful attempts to break security,
and in that case, how is this a win? Nonattack sources of broken data are
much more likely to be the result of mixing UTF-8 with other character
encodings or raw binary data.

Received on Tue May 16 2017 - 04:29:54 CDT

This archive was generated by hypermail 2.2.0 : Tue May 16 2017 - 04:29:55 CDT