Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 from Alastair Houghton via Unicode on 2017-05-18 (Unicode Mail List Archive)

From: Alastair Houghton via Unicode <unicode_at_unicode.org>
Date: Thu, 18 May 2017 08:55:49 +0100

On 18 May 2017, at 06:01, Richard Wordingham via Unicode <unicode_at_unicode.org> wrote:
>
> On Thu, 18 May 2017 02:04:55 +0200
> Philippe Verdy via Unicode <unicode_at_unicode.org> wrote:
>
>> I find intriguating that the update intends to enforce the decoding
>> of the **shortest** sequences, but now wants to treat **maximal
>> sequences** as a single unit with arbitrary length. UTF-8 was
>> designed to work only with some state machines that would NEVER need
>> to parse more than 4 bytes.
>
> If you look at the sample code in
> http://www.unicode.org/versions/Unicode2.0.0/appA.pdf, you'll see that
> it's working with 6-byte sequences. It's the Unicode, as opposed to
> ISO 10646, version that has always been restricted to 4 bytes.

There are good reasons for restricting it to four byte sequences, mind; doing so increases the number of invalid code units, which makes it easier to detect UTF-8 versus not UTF-8. I don’t think anyone is proposing allowing 5-byte or 6-byte sequences.

Kind regards,

Alastair.

--
http://alastairs-place.net

Received on Thu May 18 2017 - 02:56:05 CDT

This archive was generated by hypermail 2.2.0 : Thu May 18 2017 - 02:56:05 CDT