Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: Doug Ewell via Unicode <unicode_at_unicode.org>
Date: Wed, 17 May 2017 19:48:59 -0600

Richard Wordingham wrote:

>> I'm afraid I don't get the analogy.
>
> You can't build a full Unicode system out of Unicode-compliant parts.

Others will have to address Richard's point about canonical-equivalent
sequences.

> However, having dug out Unicode Version 2 Appendix A Section 2 UTF-8
> (in http://www.unicode.org/versions/Unicode2.0.0/appA.pdf), I find the
> critical wording, "When converting from UTF-8 to Unicode values,
> however, implementations do not need to check that the shortest
> encoding is being used,...". There was no prohibition on
> implementations performing the check, so whether C0 80 would be
> interpreted as U+0000 or as an error was unpredictable.

So it is as I said, and as TUS said before Corrigendum #1 was approved,
more than 16 years ago: It was not legal to create overlong sequences,
but implementations were allowed to interpret any that they came across.

As someone who pays attention to the fine details, you will certainly
appreciate the difference between "it was once legal to encode NUL as E0
80 80" and "it was once legal for a decoder to interpret the sequence E0
80 80 as NUL instead of rejecting it."

--
Doug Ewell | Thornton, CO, US | ewellic.org 
Received on Wed May 17 2017 - 20:50:07 CDT

This archive was generated by hypermail 2.2.0 : Wed May 17 2017 - 20:50:08 CDT