Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: Asmus Freytag via Unicode <unicode_at_unicode.org>
Date: Tue, 16 May 2017 00:22:53 -0700

On 5/15/2017 11:50 PM, Henri Sivonen via Unicode wrote:
> On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode
> <unicode_at_unicode.org> wrote:
>> I’m not sure how the discussion of “which is better” relates to the
>> discussion of ill-formed UTF-8 at all.
> Clearly, the "which is better" issue is distracting from the
> underlying issue. I'll clarify what I meant on that point and then
> move on:
>
> I acknowledge that UTF-16 as the internal memory representation is the
> dominant design. However, because UTF-8 as the internal memory
> representation is *such a good design* (when legacy constraits permit)
> that *despite it not being the current dominant design*, I think the
> Unicode Consortium should be fully supportive of UTF-8 as the internal
> memory representation and not treat UTF-16 as the internal
> representation as the one true way of doing things that gets
> considered when speccing stuff.
There are cases where it is prohibitive to transcode external data from
UTF-8 to any other format, as a precondition to doing any work. In these
situations processing has to be done in UTF-8, effectively making that
the in-memory representation. I've encountered this issue on separate
occasions, both for my own code as well as code I reviewed for clients.

I therefore think that Henri has a point when he's concerned about tacit
assumptions favoring one memory representation over another, but I think
the way he raises this point is needlessly antagonistic.
> ....At the very least a proposal should discuss the impact on the "UTF-8
> internally" case, which the proposal at hand doesn't do.

This is a key point. It may not be directly relevant to any other
modifications to the standard, but the larger point is to not make
assumption about how people implement the standard (or any of the
algorithms).
> (Moving on to a different point.)
>
> The matter at hand isn't, however, a new green-field (in terms of
> implementations) issue to be decided but a proposed change to a
> standard that has many widely-deployed implementations. Even when
> observing only "UTF-16 internally" implementations, I think it would
> be appropriate for the proposal to include a review of what existing
> implementations, beyond ICU, do.
I would like to second this as well.

The level of documented review of existing implementation practices
tends to be thin (at least thinner than should be required for changing
long-established edge cases or recommendations, let alone core
conformance requirements).
>
> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
> test with three major browsers that use UTF-16 internally and have
> independent (of each other) implementations of UTF-8 decoding
> (Firefox, Edge and Chrome) shows agreement on the current spec: there
> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
> 6 on the second, 4 on the third and 6 on the last line). Changing the
> Unicode standard away from that kind of interop needs *way* better
> rationale than "feels right".
It would be good if the UTC could work out some minimal requirements for
evaluating proposals for changes to properties and algorithms, much like
the criteria for encoding new code points
A./
Received on Tue May 16 2017 - 02:23:21 CDT

This archive was generated by hypermail 2.2.0 : Tue May 16 2017 - 02:23:21 CDT