Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: Richard Wordingham via Unicode <unicode_at_unicode.org>
Date: Wed, 31 May 2017 21:06:29 +0100

On Wed, 31 May 2017 17:43:08 +0000
Shawn Steele via Unicode <unicode_at_unicode.org> wrote:

> There also appears to be a special weight given to
> non-minimally-encoded sequences. It would seem to me that none of
> these illegal sequences should appear in practice, so we have either:

<snip>

> I do not understand the energy being invested in a case that
> shouldn't happen, especially in a case that is a subset of all the
> other bad cases that could happen.

That's not the motivation for my using a structurally based approach.
I want to expend as little energy as possible, both in thought (Keep
It Simple, Stupid) and in machine cycles, in catering for these
overlong/non-scalar value cases. I have to cater for indisputably
illegal truncated sequences, but for the rest of it I optimise for the
conformant case. If I'm extracting scalar values, I calculate the
scalar value and then check that it's legal. If I'm advancing through a
string, I just advance by the requisite number of trailing bytes.
UTF-8 is simple in concept, and I try to follow that simplicity. A
state machine overcomplicates it.

Moroever, if I want to handle CESU-8 or U+0000 as opposed to a sentinel
null, it is easy to add special case logic to a scalar value extractor.

>
> -Shawn
>
Received on Wed May 31 2017 - 15:06:52 CDT

This archive was generated by hypermail 2.2.0 : Wed May 31 2017 - 15:06:52 CDT