Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: Martin J. Dürst via Unicode <unicode_at_unicode.org>
Date: Tue, 30 May 2017 20:26:39 +0900

Hello Karl, others,

On 2017/05/27 06:15, Karl Williamson via Unicode wrote:
> On 05/26/2017 12:22 PM, Ken Whistler wrote:
>>
>> On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:
>>> The link provided about the PRI doesn't lead to the comments.
>>>
>>
>> PRI #121 (August, 2008) pre-dated the practice of keeping all the
>> feedback comments together with the PRI itself in a numbered directory
>> with the name "feedback.html". But the comments were collected
>> together at the time and are accessible here:
>>
>> http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121
>>
>> Also there was a separately submitted comment document:
>>
>> http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt
>>
>> And the minutes of the pertinent UTC meeting (UTC #116):
>>
>> http://www.unicode.org/L2/L2008/08253.htm
>>
>> The minutes simply capture the consensus to adopt Option #2 from PRI
>> #121, and the relevant action items.
>>
>> I now return the floor to the distinguished disputants to continue
>> litigating history. ;-)
>>
>> --Ken
>>
>>
>
> The reason this discussion got started was that in December, someone
> came to me and said the code I support does not follow Unicode best
> practices, and suggested I need to change, though no ticket (yet) has
> been filed. I was surprised, and posted a query to this list about what
> the advantages of the new approach are.

Can you provide a reference to that discussion? I might have missed it
in December.

> There were a number of replies,
> but I did not see anything that seemed definitive. After a month, I
> created a ticket in Unicode and Markus was assigned to research it, and
> came up with the proposal currently being debated.

Which is to completely reverse the current recommendation in Unicode
9.0. While I agree that this might help you fending off a bug report, it
would create chances for bug reports for Ruby, Python3, many if not all
Web browsers,...

> Looking at the PRI, it seems to me that treating an overlong as a single
> maximal unit is in the spirit of the wording, if not the fine print.

In standards, the "fine print" matters.

> That seems to be borne out by Markus, even with his stake in ICU,
> supporting option #2.

Well, at http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121, I
also supported option 2, with code behind it.

> Looking at the comments, I don't see any discussion of the effect of
> this on overlong treatments. My guess is that the effect change was
> unintentional.

I agree that it was probably not considered explicitly. But overlongs
were disallowed for security reasons, and once the definition of UTF-8
was tightened, "overlongs" essentially did not exist anymore.
Essentially, "overlong" is a word like "dragon" or "ghost": Everybody
knows what it means, but everybody knows they don't exist.

[Just to be sure, by the above, I don't mean that a sequence such as
C0 B0 cannot appear somewhere in some input. But C0 is not UTF-8 all by
itself, and there is no need to see C0 B0 as a (ghost) sequence.]

> So I have code that handled overlongs in the only correct way possible
> when they were acceptable,

No. As long as they were acceptable, they wouldn't have been replaced by
an FFFD.

> and in the obvious way after they became illegal,

Why? A change was necessary from producing an actual character to
producing some number of FFFDs. It may have been easier to produce just
a single FFFD, but that depends on how the code was organized.

> and now without apparent discussion (which is very much akin to
> "flimsy reasons"), it suddenly was no longer "best practice".

Not 'now', but almost 9 years ago. And not "without apparent
discussion", but with an explicit PRI.

> And that
> change came "rather late in the game". That this escaped notice for
> years indicates that the specifics of REPLACEMENT CHAR handling don't
> matter all that much.

I agree. You haven't even yet received a ticket yet.

> To cut to the chase, I think Unicode should issue a Corrigendum to the
> effect that it was never the intent of this change to say that treating
> overlongs as a single unit isn't best practice. I'm not sure this
> warrants a full-fledge Corrigendum, though. But I believe the text of
> the best practices should indicate that treating overlongs as a single
> unit is just as acceptable as Martin's interpretation.

I'd essentially be fine with that, under the condition that the current
recommendation is maintained as a clearly identified recommendation, so
that Python3, Ruby, Web standards and browsers, and so on can easily
refer to it.

Regards, Martin.

> I believe this is pretty much in line with Shawn's position. Certainly,
> a discussion of the reasons one might choose one interpretation over
> another should be included in TUS. That would likely have satisfied my
> original query, which hence would never have been posted.
> .
>
Received on Tue May 30 2017 - 06:27:07 CDT

This archive was generated by hypermail 2.2.0 : Tue May 30 2017 - 06:27:07 CDT