Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored? from Doug Ewell via Unicode on 2017-07-24 (Unicode Mail List Archive)

From: Doug Ewell via Unicode <unicode_at_unicode.org>
Date: Mon, 24 Jul 2017 08:50:24 -0700

Costello, Roger L. wrote:

> Suppose an application splits a UTF-8 multi-octet sequence. The
> application then sends the split sequence to a client. The client must
> restore the original sequence.
>
> Question: is it possible to split a UTF-8 multi-octet sequence in such
> a way that the client cannot unambiguously restore the original
> sequence?

1. (Bug) The folding process inserts CRLF plus white space characters,
and the unfolding process doesn't properly delete all of them.

2. (Non-conformant behavior) Some process, after folding and before
unfolding, attempts to interpret the partial UTF-8 sequences and
converts them into replacement characters or worse.

In a minimally decent implementation, splitting and reassembling a UTF-8
sequence should always yield the correct result; there should be no
ambiguity.

A good implementation, of course, would know the character encoding of
the data, and would not split multi-byte sequences in that encoding to
begin with.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Received on Mon Jul 24 2017 - 10:51:14 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 24 2017 - 10:51:14 CDT