Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: Martin J. Dürst via Unicode <unicode_at_unicode.org>
Date: Fri, 26 May 2017 19:28:36 +0900

On 2017/05/25 09:22, Markus Scherer wrote:
> On Wed, May 24, 2017 at 3:56 PM, Karl Williamson <public_at_khwilliamson.com>
> wrote:
>
>> On 05/24/2017 12:46 AM, Martin J. Dürst wrote:
>>
>>> That's wrong. There was a public review issue with various options and
>>> with feedback, and the recommendation has been implemented and in use
>>> widely (among else, in major programming language and browsers) without
>>> problems for quite some time.
>>>
>>
>> Could you supply a reference to the PRI and its feedback?
>>
>
> http://www.unicode.org/review/resolved-pri-100.html#pri121
>
> The PRI did not discuss possible different versions of "maximal subpart",
> and the examples there yield the same results either way. (No non-shortest
> forms.)

It is correct that it didn't give any of the *examples* that are under
discussion now. On the other hand, the PRI is very clear about what it
means by "maximal subpart":

Citing directly from the PRI:

>>>>
The term "maximal subpart of the ill-formed subsequence" refers to the
longest potentially valid initial subsequence or, if none, then to the
next single code unit.
>>>>

At the time of the PRI, so-called "overlongs" were already ill-formed.

That change goes back to 2003 or earlier (RFC 3629
(https://tools.ietf.org/html/rfc3629) was published in 2003 to reflect
the tightening of the UTF-8 definition in Unicode/ISO 10646).

> The recommendation in TUS 5.2 is "Replace each maximal subpart of an
>> ill-formed subsequence by a single U+FFFD."
>>
>
> You are right.
>
> http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly
> expanded example compared with the PRI.
>
> The text simply talked about a "conversion process" stopping as soon as it
> encounters something that does not fit, so these edge cases would depend on
> whether the conversion process treats original-UTF-8 sequences as single
> units.

No, the text, both in the PRI and in Unicode 5.2, is quite clear. The
"does not fit" (which I haven't found in either text) is clearly
grounded by "ill-formed UTF-8". And there's no question about what
"ill-formed UTF-8" means, in particular in Unicode 5.2, where you just
have to go two pages back to find byte sequences such as <C0 AF>, <E0 9F
80>, and <F4 80 83 92> all called out explicitly as ill-formed.

Any kind of claim, as in the L2/17-168 document, about there being an
option 2a, are just not substantiated. It's true that there are no
explicit examples in the PRI that would allow to distinguish between
converting e.g.
FC BF BF BF BF 80
to a single FFFD or to six of these. But there's no need to have
examples for every corner case if the text is clear enough. In the above
six-byte sequence, there's not a single potentially valid (initial)
subsequence, so it's all single code units.

>> And I agree with that. And I view an overlong sequence as a maximal
>> ill-formed subsequence

Can you point to any definition that would include or allow such an
interpretation? I just haven't found any yet, neither in the PRI nor in
Unicode 5.2.

>> that should be replaced by a single FFFD. There's
>> nothing in the text of 5.2 that immediately follows that recommendation
>> that indicates to me that my view is incorrect.

I have to agree that the text in Unicode 5.2 could be clearer. It's a
hodgepodge of attempts at justifications and definitions. And the word
"maximal" itself may also contribute to pushing the interpretation in
one direction.

But there's plenty in the text that makes it absolutely clear that some
things cannot be included. In particular, it says

>>>>
The term “maximal subpart of an ill-formed subsequence” refers to the
code units that were collected in this manner. They could be the start
of a well-formed sequence, except that the sequence lacks the proper
continuation. Alternatively, the converter may have found an
continuation code unit, which cannot be the start of a well-formed sequence.
>>>>

And the "in this manner" refers to:
>>>>
A sequence of code units will be processed up to the point where the
sequence either can be unambiguously interpreted as a particular Unicode
code point or where the converter recognizes that the code units
collected so far constitute an ill-formed subsequence.
>>>>

So we have the same thing twice: Bail out as soon as something is
ill-formed.

>> Perhaps my view is colored by the fact that I now maintain code that was
>> written to parse UTF-8 back when overlongs were still considered legal
>> input.

Thanks for providing this information. That's a lot more useful than
"feels right", which was given as a reason on this list before.

>> An overlong was a single unit. When they became illegal, the code
>> still considered them a single unit.

That's fine for your code. I might do the same (or not) if I were you,
because one indeed never knows in which situation some code is used, and
what repercussions a change might produce.

But the PRI, and the wording in Unicode 5.2, was created when overlongs
and 5-byte and 6-byte sequences and surrogate pairs,... were very
clearly ill-formed already. If these texts had intended to make an
exception for any of these cases, it would clearly have had to be
written into the actual text.

Saying something like "the text isn't clear because it says ill-formed,
but maybe it doesn't mean ill-formed at the time it was written, but
quite a few years before" just doesn't make sense to me at all.

> I can understand how someone who comes along later could say C0 can't be
>> followed by any continuation character that doesn't yield an overlong,
>> therefore C0 is a maximal subsequence.

Yes indeed, because maximal subsequences are defined by reference to
well-formed/ill-formed subsequences, and what's ill-formed is defined in
the same standard at the same time.

There's nobody "coming along later". That kind of wording would be
appropriate if the PRI and the recommendation in the standard had been
made e.g. in the 1990ies, before the tightening of the UTF-8 definition.
Then somebody could say that Unicode overlooked that they implicitly
changed the recommendation for how to generate U+FFFDs by changing the
definition of well-formed UTF-8.

But no such thing at all happened. The PRI was evaluated, and the
recommendation included in the text of Unicode, in the context of the
then-existing (and since then unchanged) definition of UTF-8.

>> But I assert that my interpretation is just as valid as that one.

Sorry, but it cannot be valid, because of the timing. The tightening of
the UTF-8 definition happened years before the PRI.

>> And perhaps more so, because of historical precedent.

It's good to know that there are older implementations that behave
differently. And I understand that in some cases, these might be
reluctant to change. Mine, and Henri's, comments are very much motivated
by the fact that we are reluctant to change our implementations.

It may be worth to think about whether the Unicode standard should
mention implementations like yours. But there should be no doubt about
the fact that the PRI and Unicode 5.2 (and the current version of
Unicode) are clear about what they recommend, and that that
recommendation is based on the definition of UTF-8 at that time (and
still in force), and not at based on a historical definition of UTF-8.

Regards, Martin.
Received on Fri May 26 2017 - 05:29:23 CDT

This archive was generated by hypermail 2.2.0 : Fri May 26 2017 - 05:29:24 CDT