Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: Alastair Houghton via Unicode <>
Date: Wed, 17 May 2017 09:07:25 +0100

> On 16 May 2017, at 20:43, Richard Wordingham via Unicode <> wrote:
> On Tue, 16 May 2017 11:36:39 -0700
> Markus Scherer via Unicode <> wrote:
>> Why do we care how we carve up an illegal sequence into subsequences?
>> Only for debugging and visual inspection. Maybe some process is using
>> illegal, overlong sequences to encode something special (à la Java
>> string serialization, "modified UTF-8"), and for that it might be
>> convenient too to treat overlong sequences as single errors.
> I think that's not quite true. If we are moving back and forth through
> a buffer containing corrupt text, we need to make sure that moving three
> characters forward and then three characters back leaves us where we
> started. That requires internal consistency.

That’s very true. But the proposed change doesn’t actually affect that; it’s still the case that you can correctly identify boundaries in both directions.

Kind regards,


Received on Wed May 17 2017 - 03:08:15 CDT

This archive was generated by hypermail 2.2.0 : Wed May 17 2017 - 03:08:16 CDT