Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 from Hans Åberg via Unicode on 2017-05-18 (Unicode Mail List Archive)

From: Hans Åberg via Unicode <unicode_at_unicode.org>
Date: Thu, 18 May 2017 10:30:24 +0200

> On 16 May 2017, at 15:21, Richard Wordingham via Unicode <unicode_at_unicode.org> wrote:
>
> On Tue, 16 May 2017 14:44:44 +0200
> Hans Åberg via Unicode <unicode_at_unicode.org> wrote:
>
>>> On 15 May 2017, at 12:21, Henri Sivonen via Unicode
>>> <unicode_at_unicode.org> wrote:
>> ...
>>> I think Unicode should not adopt the proposed change.
>>
>> It would be useful, for use with filesystems, to have Unicode
>> codepoint markers that indicate how UTF-8, including non-valid
>> sequences, is translated into UTF-32 in a way that the original octet
>> sequence can be restored.
>
> Escape sequences for the inappropriate bytes is the natural technique.
> Your problem is smoothly transitioning so that the escape character is
> always escaped when it means itself. Strictly, it can't be done.
>
> Of course, some sequences of escaped characters should be prohibited.
> Checking could be fiddly.

One could write the bytes using \xnn escape codes, sequences terminated using \& as in Haskell, translating '\' into "\\". It then becomes a C-encoded string, not plain text.
Received on Thu May 18 2017 - 03:30:59 CDT

This archive was generated by hypermail 2.2.0 : Thu May 18 2017 - 03:30:59 CDT