Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 from Alastair Houghton via Unicode on 2017-05-15 (Unicode Mail List Archive)

From: Alastair Houghton via Unicode <unicode_at_unicode.org>
Date: Mon, 15 May 2017 19:02:34 +0100

On 15 May 2017, at 18:52, Asmus Freytag <asmusf_at_ix.netcom.com> wrote:
>
> On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote:
>> On 15 May 2017, at 11:21, Henri Sivonen via Unicode <unicode_at_unicode.org> wrote:
>>> In reference to:
>>> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
>>>
>>> I think Unicode should not adopt the proposed change.
>> Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense.
>
> Changing a specification as fundamental as this is something that should not be undertaken lightly.

Agreed.

> Apparently we have a situation where implementations disagree, and have done so for a while. This normally means not only that the implementations differ, but that data exists in both formats.
>
> Even if it were true that all data is only stored in UTF-8, any data converted from UFT-8 back to UTF-8 going through an interim stage that requires UTF-8 conversion would then be different based on which converter is used.
>
> Implementations working in UTF-8 natively would potentially see three formats:
> 1) the original ill-formed data
> 2) data converted with single FFFD
> 3) data converted with multiple FFFD
>
> These forms cannot be compared for equality by binary matching.

But that was always true, if you were under the impression that only one of (2) and (3) existed, and indeed claiming equality between two instances of U+FFFD might be problematic itself in some circumstances (you don’t know why the U+FFFDs were inserted - they may not replace the same original data).

> The best that can be done is to convert (1) into one of the other forms and then compare treating any run of FFFD code points as equal to any other run, irrespective of length.

It’s probably safer, actually, to refuse to compare U+FFFD as equal to anything (even itself) unless a special flag is passed. For “general purpose” applications, you could set that flag and then a single U+FFFD would compare equal to another single U+FFFD; no need for the complicated “any string of U+FFFD” logic (which in any case makes little sense - it could just as easily generate erroneous comparisons as fix the case we’re worrying about here).

> Because we've had years of multiple implementations, it would be expected that copious data exists in all three formats, and that data will not go away. Changing the specification to pick one of these formats as solely conformant is IMHO too late.

I don’t think so. Even if we acknowledge the possibility of data in the other form, I think it’s useful guidance to implementers, both now and in the future. One might even imagine that the other, non-favoured form, would eventually fall out of use.

Kind regards,

Alastair.

--
http://alastairs-place.net

Received on Mon May 15 2017 - 13:02:46 CDT

This archive was generated by hypermail 2.2.0 : Mon May 15 2017 - 13:02:46 CDT