RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: Shawn Steele via Unicode <unicode_at_unicode.org>
Date: Tue, 16 May 2017 18:50:00 +0000

But why change a recommendation just because it “feels like”. As you said, it’s just a recommendation, so if that really annoyed someone, they could do something else (eg: they could use a single FFFD).

If the recommendation is truly that meaningless or arbitrary, then we just get into silly discussions of “better” that nobody can really answer.

Alternatively, how about “one or more FFFDs?” for the recommendation?

To me it feels very odd to perhaps require writing extra code to detect an illegal case. The “best practice” here should maybe be “one or more FFFDs, whatever makes your code faster”.

Best practices may not be requirements, but people will still take time to file bugs that something isn’t following a “best practice”.

-Shawn

From: Unicode [mailto:unicode-bounces_at_unicode.org] On Behalf Of Markus Scherer via Unicode
Sent: Tuesday, May 16, 2017 11:37 AM
To: Alastair Houghton <alastair_at_alastairs-place.net>
Cc: Philippe Verdy <verdy_p_at_wanadoo.fr>; Henri Sivonen <hsivonen_at_hsivonen.fi>; unicode Unicode Discussion <unicode_at_unicode.org>; Hans Åberg <haberg-1_at_telia.com>
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Let me try to address some of the issues raised here.

The proposal changes a recommendation, not a requirement. Conformance applies to finding and interpreting valid sequences properly. This includes not consuming parts of valid sequences when dealing with illegal ones, as explained in the section "Constraints on Conversion Processes".

Otherwise, what you do with illegal sequences is a matter of what you think makes sense -- a matter of opinion and convenience. Nothing more.

I wrote my first UTF-8 handling code some 18 years ago, before joining the ICU team. At the time, I believe the ISO UTF-8 definition was not yet limited to U+10FFFF, and decoding overlong sequences and those yielding surrogate code points was regarded as a misdemeanor. The spec has been tightened up, but I am pretty sure that most people familiar with how UTF-8 came about would recognize <C0 AF> and <E0 9F 80> as single sequences.

I believe that the discussion of how to handle illegal sequences came out of security issues a few years ago from some implementations including valid single and lead bytes with preceding illegal sequences. Beyond the "Constraints on Conversion Processes", there was evidently also a desire to recommend how to handle illegal sequences.

I think that the current recommendation was an extrapolation of common practice for non-UTF encodings, such as Shift-JIS or GB 18030. It's ok for UTF-8, too, but "it feels like" (yes, that's the level of argument for stuff that doesn't really matter) not treating <C0 AF> and <E0 9F 80> as single sequences is "weird".

Why do we care how we carve up an illegal sequence into subsequences? Only for debugging and visual inspection. Maybe some process is using illegal, overlong sequences to encode something special (à la Java string serialization, "modified UTF-8"), and for that it might be convenient too to treat overlong sequences as single errors.

If you don't like some recommendation, then do something else. It does not matter. If you don't reject the whole input but instead choose to replace illegal sequences with something, then make sure the something is not nothing -- replacing with an empty string can cause security issues. Otherwise, what the something is, or how many of them you put in, is not very relevant. One or more U+FFFDs is customary.

When the current recommendation came in, I thought it was reasonable but didn't like the edge cases. At the time, I didn't think it was important to twiddle with the text in the standard, and I didn't care that ICU didn't exactly implement that particular recommendation.

I have seen implementations that clobber every byte in an illegal sequence with a space, because it's easier than writing an U+FFFD for each byte or for some subsequences. Fine. Someone might write a single U+FFFD for an arbitrarily long illegal subsequence; that's fine, too.

Karl Williamson sent feedback to the UTC, "In short, I believe the best practices are wrong." I think "wrong" is far too strong, but I got an action item to propose a change in the text. I proposed a modified recommendation. Nothing gets elevated to "right" that wasn't, nothing gets demoted to "wrong" that was "right".

None of this is motivated by which UTF is used internally.

It is true that it takes a tiny bit more thought and work to recognize a wider set of sequences, but a capable implementer will optimize successfully for valid sequences, and maybe even for a subset of those for what might be expected high-frequency code point ranges. Error handling can go into a slow path. In a true state table implementation, it will require more states but should not affect the performance of valid sequences.

Many years ago, I decided for ICU to add a small amount of slow-path error-handling code for more human-friendly illegal-sequence reporting. In other words, this was not done out of convenience; it was an inconvenience that seemed justified by nicer error reporting. If you don't like to do so, then don't.

Which UTF is better? It depends. They all have advantages and problems. It's all Unicode, so it's all good.

ICU largely uses UTF-16 but also UTF-8. It has data structures and code for charset conversion, property lookup, sets of characters (UnicodeSet), and collation that are co-optimized for both UTF-16 and UTF-8. It has a slowly growing set of APIs working directly with UTF-8.

So, please take a deep breath. No conformance requirement is being touched, no one is forced to do something they don't like, no special consideration is given for one UTF over another.

Best regards,
markus
Received on Tue May 16 2017 - 13:50:20 CDT

This archive was generated by hypermail 2.2.0 : Tue May 16 2017 - 13:50:20 CDT