Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: Markus Scherer via Unicode <unicode_at_unicode.org>
Date: Sat, 3 Jun 2017 21:09:01 -0700

On Wed, May 31, 2017 at 5:12 AM, Henri Sivonen <hsivonen_at_hsivonen.fi> wrote:

> On Sun, May 21, 2017 at 7:37 PM, Mark Davis ☕️ via Unicode
> <unicode_at_unicode.org> wrote:
> > There is plenty of time for public comment, since it was targeted at
> Unicode
> > 11, the release for about a year from now, not Unicode 10, due this year.
> > When the UTC "approves a change", that change is subject to comment, and
> the
> > UTC can always reverse or modify its approval up until the meeting before
> > release date. So there are ca. 9 months in which to comment.
>
> What should I read to learn how to formulate an appeal correctly?
>

I suggest you submit a write-up via http://www.unicode.org/reporting.html

and make the case there that you think the UTC should retract

http://www.unicode.org/L2/L2017/17103.htm#151-C19

*B.13.3.3 Illegal UTF-8 [Scherer, L2/17-168
<http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/17-168>*]

*[151-C19 <http://www.unicode.org/cgi-bin/GetL2Ref.pl?151-C19>]
Consensus:* Modify
the section on "Best Practices for Using FFFD" in section "3.9 Encoding
Forms" of TUS per the recommendation in L2/17-168
<http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/17-168>, for Unicode
version 11.0.

Does it matter if a proposal/appeal is submitted as a non-member
> implementor person, as an individual person member or as a liaison
> member?

The reporting.html form exists for gathering feedback from the public. The
UTC regularly reviews and considers such feedback in its quarterly meetings.

Also, since Chromium/Blink/v8 are using ICU, I suggest you submit an ICU
ticket via http://bugs.icu-project.org/trac/newticket

and make the case there, too, that you think (assuming you do) that ICU
should change its handling of illegal UTF-8 sequences.

> If people really believed that the guidelines in that section should have
> > been conformance clauses, they should have proposed that at some point.
>
> It seems to me that this thread does not support the conclusion that
> the Unicode Standard's expression of preference for the number of
> REPLACEMENT CHARACTERs should be made into a conformance requirement
> in the Unicode Standard. This thread could be taken to support a
> conclusion that the Unicode Standard should not express any preference
> beyond "at least one and at most as many as there were bytes".
>

Given the discussion and controversy here, in my opinion, the standard
should probably tone down the "best practice" and "recommendation" language.

> Aside from UTF-8 history, there is a reason for preferring a more
> > "structural" definition for UTF-8 over one purely along valid sequences.
> > This applies to code that *works* on UTF-8 strings rather than just
> > converting them. For UTF-8 *processing* you need to be able to iterate
> both
> > forward and backward, and sometimes you need not collect code points
> while
> > skipping over n units in either direction -- but your iteration needs to
> be
> > consistent in all cases. This is easier to implement (especially in fast,
> > short, inline code) if you have to look only at how many trail bytes
> follow
> > a lead byte, without having to look whether the first trail byte is in a
> > certain range for some specific lead bytes.
>
> But the matter at hand is decoding potentially-invalid UTF-8 input
> into a valid in-memory Unicode representation, so later processing is
> somewhat a red herring as being out of scope for this step. I do agree
> that if you already know that the data is valid UTF-8, it makes sense
> to work from the bit pattern definition only.

No, it's not a red herring. Not every piece of software has a neat "inside"
with all valid text, and with a controllable surface to the "outside".

In a large project with a small surface for text to enter the system, such
as a browser with a centralized chunk of code for handling streams of input
text, it might well work to validate once and then assume "on the inside"
that you only ever see well-formed text.

In a library with API of the granularity of "compare two strings",
"uppercase a string" or "normalize a string", you have no control over your
input; you cannot assume that your input is valid; you cannot crash when
it's not valid; you cannot overrun your buffer; you cannot go into an
endless loop. It's also cumbersome to fail with an error whenever you
encounter invalid text, because you need more code for error detection &
handling, and because significant C++ code bases do not allow exceptions.
(Besides, ICU also offers C APIs.)

Processing potentially-invalid UTF-8, iterating over it, and looking up
data for it, *can* definitely be simpler (take less code etc.) if for any
given lead byte you always collect the same maximum number of trail bytes,
and if you have fewer distinct types of lead bytes with their corresponding
sequences.

Best regards,
markus
Received on Sat Jun 03 2017 - 23:09:33 CDT

This archive was generated by hypermail 2.2.0 : Sat Jun 03 2017 - 23:09:34 CDT