Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 from Martin J. Dürst via Unicode on 2017-05-16 (Unicode Mail List Archive)

From: Martin J. Dürst via Unicode <unicode_at_unicode.org>
Date: Tue, 16 May 2017 20:08:52 +0900

Hello everybody,

[using this mail to in effect reply to different mails in the thread]

On 2017/05/16 17:31, Henri Sivonen via Unicode wrote:
> On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag <asmusf_at_ix.netcom.com> wrote:

>> Under what circumstance would it matter how many U+FFFDs you see?
>
> Maybe it doesn't, but I don't think the burden of proof should be on
> the person advocating keeping the spec and major implementations as
> they are. If anything, I think those arguing for a change of the spec
> in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing
> with the current spec should show why it's important to have a
> different number of U+FFFDs than the spec's "best practice" calls for
> now.

I have just checked (the programming language) Ruby. Some background:

As you might know, Ruby is (at least in theory) pretty
encoding-independent, meaning you can run scripts in iso-8859-1, in
Shift_JIS, in UTF-8, or in any of quite a few other encodings directly,
without any conversion.

However, in practice, incl. Ruby on Rails, Ruby is very much using UTF-8
internally, and is optimized to work well that way. Character encoding
conversion also works with UTF-8 as the pivot encoding.

As far as I understand, Ruby does the same as all of the above software,
based (among else) on the fact that we followed the recommendation in
the standard. Here are a few examples (sorry for the linebreaks
introduced by mail software):

$ ruby -e 'puts "\xF0\xaf".encode("UTF-16BE", invalid: :replace).inspect'
#=> "\uFFFD"

$ ruby -e 'puts "\xe0\x80\x80".encode("UTF-16BE", invalid:
:replace).inspect'
#=> "\uFFFD\uFFFD\uFFFD"

$ ruby -e 'puts "\xF4\x90\x80\x80".encode("UTF-16BE", invalid:
:replace).inspect'
#=> "\uFFFD\uFFFD\uFFFD\uFFFD"

$ ruby -e 'puts "\xfd\x81\x82\x83\x84\x85".encode("UTF-16BE", invalid:
:replace).inspect
#=> "\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD"

$ ruby -e 'puts "\x41\xc0\xaf\x41\xf4\x80\x80\x41".encode("UTF-16BE",
invalid: : replace).inspect'
#=> "A\uFFFD\uFFFDA\uFFFDA"

This is based on http://www.unicode.org/review/pr-121.html as noted at
https://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/test/ruby/test_transcode.rb?revision=56516&view=markup#l1507
(for those having a look at these tests, in Ruby's version of
assert_equal, the expected value comes first (not sure whether this is
called little-endian or big-endian :-), but this is a decision where the
various test frameworks are virtually split 50/50 :-(. ))

Even if the above examples and the tests use conversion to UTF-16 (in
particular the BE variant for better readability), what happens
internally is that the input is analyzed byte-by-byte. In this case, it
is easiest to just stop as soon as something is found that is clearly
invalid (be this a single byte or something longer). This makes a
data-driven implementation (such as the Ruby transcoder) or one based on
a state machine (such as http://bjoern.hoehrmann.de/utf-8/decoder/dfa/)
more compact.

In other words, because we never know whether the next byte is a valid
one such as 0x41, it's easier to just handle one byte at a time if this
way we can avoid lookahead (which is always a good idea when parsing).

I agree with Henri and others that there is no need at all to change the
recommendation in the standard that has been stable for so long (close
to 9 years).

Because the original was done on a PR
(http://www.unicode.org/review/pr-121.html), I think this should at
least also be handled as PR (if it's not dropped based on the discussion
here).

I think changing the current definition of "maximal subsequence" is a
bad idea, because it would mean that one wouldn't know what one was
speaking about over the years. If necessary, new definitions should be
introduced for other variants.

I agree with others that ICU should not be considered to have a special
status, it should be just one implementation among others.

[The next point is a side issue, please don't spend too much time on
it.] I find it particularly strange that at a time when UTF-8 is firmly
defined as up to 4 bytes, never including any bytes above 0xF4, the
Unicode consortium would want to consider recommending that <FD 81 82 83
84 85> be converted to a single U+FFFD. I note with agreement that
Markus seems to have thoughts in the same direction, because the
proposal (17168-utf-8-recommend.pdf) says "(I suppose that lead bytes
above F4 could be somewhat debatable.)".

Regards, Martin.
Received on Tue May 16 2017 - 06:09:22 CDT

This archive was generated by hypermail 2.2.0 : Tue May 16 2017 - 06:09:22 CDT