RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: Shawn Steele via Unicode <unicode_at_unicode.org>
Date: Wed, 31 May 2017 19:28:03 +0000

> it’s more meaningful for whoever sees the output to see a single U+FFFD representing
> the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid lead byte and
> then another for an “unexpected” trailing byte.

I disagree. It may be more meaningful for some applications to have a single U+FFFD representing an illegally encoded 2-byte NULL than to have 2 U+FFFDs. Of course then you don't know if it was an illegally encoded 2-byte NULL or an illegally encoded 3-byte NULL or whatever, so some information that other applications may be interested in is lost.

Personally, I prefer the "emit a U+FFFD if the sequence is invalid, drop the byte, and try again" approach.

-Shawn
Received on Wed May 31 2017 - 14:28:22 CDT

This archive was generated by hypermail 2.2.0 : Wed May 31 2017 - 14:28:22 CDT