Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 from Hans Åberg via Unicode on 2017-05-17 (Unicode Mail List Archive)

From: Hans Åberg via Unicode <unicode_at_unicode.org>
Date: Wed, 17 May 2017 23:05:47 +0200

> On 17 May 2017, at 22:36, Doug Ewell via Unicode <unicode_at_unicode.org> wrote:
>
> Hans Åberg wrote:
>
>> It would be useful, for use with filesystems, to have Unicode
>> codepoint markers that indicate how UTF-8, including non-valid
>> sequences, is translated into UTF-32 in a way that the original
>> octet sequence can be restored.
>
> I have always argued strongly against this idea, and always will.
>
> Far from solving the stated problem, it would introduce a new one:
> conversion from the "bad data" Unicode code points, currently
> well-defined, would become ambiguous.

Actually not: just translate the invalid UTF-8 sequences into invalid UTF-32. No Unicode extensions are needed, as it has no say about what to happen with what it considers invalid.

> File systems cannot have it both ways: they must define file names
> either as unrestricted sequences of bytes, or as strings of characters
> in some defined encoding. If they choose the latter, they need to define
> conversion mechanisms with suitable fallback and adhere to them. They
> can use the PUA if they like.

The latter is complicated, so that is not what one does I am told, with some exception. Also, one may end up with a file in an unknown encoding, say imported remotely, and then the OS cannot deal with it.
Received on Wed May 17 2017 - 16:06:09 CDT

This archive was generated by hypermail 2.2.0 : Wed May 17 2017 - 16:06:09 CDT