Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: Richard Wordingham via Unicode <unicode_at_unicode.org>
Date: Thu, 18 May 2017 19:03:09 +0100

On Thu, 18 May 2017 09:58:43 +0100
Alastair Houghton via Unicode <unicode_at_unicode.org> wrote:

> On 18 May 2017, at 07:18, Henri Sivonen via Unicode
> <unicode_at_unicode.org> wrote:
> >
> > the decision complicates U+FFFD generation when validating UTF-8 by
> > state machine.
>
> It *really* doesn’t. Even if you’re hell bent on using a pure state
> machine approach, you need to add maybe two additional error states
> (two-trailing-bytes-to-eat-then-fffd and
> one-trailing-byte-to-eat-then-fffd) on top of the states you already
> have. The implementation complexity argument is a *total* red
> herring.

For big programs, yes. However, for a small program it can be
attractive to have a small hand-coded routine so that the source code
can sit in a single file. It can even allow a basically UTF-8 program
to meet a requirement to be able to match lone surrogates in a regular
expression, as was once required.

Richard.
Received on Thu May 18 2017 - 13:03:48 CDT

This archive was generated by hypermail 2.2.0 : Thu May 18 2017 - 13:03:48 CDT