Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: Alastair Houghton via Unicode <>
Date: Thu, 18 May 2017 08:55:49 +0100

On 18 May 2017, at 06:01, Richard Wordingham via Unicode <> wrote:
> On Thu, 18 May 2017 02:04:55 +0200
> Philippe Verdy via Unicode <> wrote:
>> I find intriguating that the update intends to enforce the decoding
>> of the **shortest** sequences, but now wants to treat **maximal
>> sequences** as a single unit with arbitrary length. UTF-8 was
>> designed to work only with some state machines that would NEVER need
>> to parse more than 4 bytes.
> If you look at the sample code in
>, you'll see that
> it's working with 6-byte sequences. It's the Unicode, as opposed to
> ISO 10646, version that has always been restricted to 4 bytes.

There are good reasons for restricting it to four byte sequences, mind; doing so increases the number of invalid code units, which makes it easier to detect UTF-8 versus not UTF-8. I don’t think anyone is proposing allowing 5-byte or 6-byte sequences.

Kind regards,


Received on Thu May 18 2017 - 02:56:05 CDT

This archive was generated by hypermail 2.2.0 : Thu May 18 2017 - 02:56:05 CDT