Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 from Philippe Verdy via Unicode on 2017-05-26 (Unicode Mail List Archive)

From: Philippe Verdy via Unicode <unicode_at_unicode.org>
Date: Fri, 26 May 2017 15:22:54 +0200

>
> Citing directly from the PRI:
>
> >>>>
> The term "maximal subpart of the ill-formed subsequence" refers to the
> longest potentially valid initial subsequence or, if none, then to the next
> single code unit.
> >>>>
>

The way i understand it is that C0 80 will have TWO maximal subparts,
because there's not any valid initial subsequence, so only the next single
code unit (C0) will be considered. After this the following byte 80 also
has not any valid initial subsequence, so here again only the next single
code unit (80) will be considered. You'll get U+FFFD replacements emitted
twice. This treats all cases of "overlong" sequences that were in the old
UTF-8 definition in the first RFC.

For C3 80 20, there will be only ONE maximal subpart because C3 80 is a
valid initial subsequence, so a single U+FFFD replacement will be emitted,
followed then by the valid UTF-8 sequence (20) which will correctly decode
as U+0020.

Good ! This means that this proposal makes sense and is compatible with
random accesses within the encoded text whithout having to look backward
for an indefinite number of code units and we never have to handle any case
with possibly infinite number of code units mapped to the same U+FFFD
replacement.
Received on Fri May 26 2017 - 08:23:52 CDT

This archive was generated by hypermail 2.2.0 : Fri May 26 2017 - 08:23:53 CDT