Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 from Doug Ewell via Unicode on 2017-05-17 (Unicode Mail List Archive)

From: Doug Ewell via Unicode <unicode_at_unicode.org>
Date: Wed, 17 May 2017 13:37:51 -0700

Richard Wordingham wrote:

>> It is not at all clear what the intent of the encoder was - or even
>> if it's not just a problem with the data stream. E0 80 80 is not
>> permitted, it's garbage. An encoder can't "intend" it.
>
> It was once a legal way of encoding NUL, just like C0 E0, which is
> still in use, and seems to be the best way of storing NUL as character
> content in a *C string*.

I wish I had a penny for every time I'd seen this urban legend.

At http://doc.cat-v.org/bell_labs/utf-8_history you can read the
original definition of UTF-8, from Ken Thompson on 1992-09-08, so long
ago that it was still called FSS-UTF:

"When there are multiple ways to encode a value, for example
UCS 0, only the shortest encoding is legal."

Unicode once permitted implementations to *decode* non-shortest forms,
but never allowed an implementation to *create* them
(http://www.unicode.org/versions/corrigendum1.html):

"For example, UTF-8 allows nonshortest code value sequences to be
interpreted: a UTF-8 conformant may map the code value sequence C0 80
(11000000₂ 10000000₂) to the Unicode value U+0000, even though a
UTF-8 conformant process shall never generate that code value sequence
-- it shall generate the sequence 00 (00000000₂) instead."

This was the passage that was deleted as part of Corrigendum #1.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Received on Wed May 17 2017 - 15:39:09 CDT

This archive was generated by hypermail 2.2.0 : Wed May 17 2017 - 15:39:09 CDT