Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 from Philippe Verdy via Unicode on 2017-05-16 (Unicode Mail List Archive)

From: Philippe Verdy via Unicode <unicode_at_unicode.org>
Date: Tue, 16 May 2017 12:44:00 +0200

>
> The proposal actually does cover things that aren’t structurally valid,
> like your e0 e0 e0 example, which it suggests should be a single U+FFFD
> because the initial e0 denotes a three byte sequence, and your 80 80 80
> example, which it proposes should constitute three illegal subsequences
> (again, both reasonable). However, I’m not entirely certain about things
> like
>
> e0 e0 c3 89
>
> which the proposal would appear to decode as
>
> U+FFFD U+FFFD U+FFFD U+FFFD (3)
>
> instead of a perhaps more reasonable
>
> U+FFFD U+FFFD U+00C9 (4)
>
> (the key part is the “without ever restricting trail bytes to less than
> 80..BF”)
>

I also agree with that, due to access in strings from random position: if
you access it from byte 0x89, you can assume it's a trialing byte and
you'll want to look backward, and will see 0xc3,0x89 which will decode
correctly as U+00C9 without any error detected.

So the wrong bytes are only the initial two occurences of 0x80 which are
individually converted to U+FFFD.

In summary: when you detect any ill-formed sequence, only replace the first
code unit by U+FFFD and restart scanning from the next code unit, without
skeeping over multiple bytes.

This means that multiple occurences of U+FFFD is not only the best
practice, it also matches the intended design of UTF-8 to allow access from
random positions.
Received on Tue May 16 2017 - 05:44:41 CDT

This archive was generated by hypermail 2.2.0 : Tue May 16 2017 - 05:44:41 CDT