RE: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

From: Costello, Roger L. via Unicode <unicode_at_unicode.org>
Date: Mon, 24 Jul 2017 17:57:43 +0000

Hi Folks,

Thank you very much for your fantastic comments!

Below I summarized the issue and your comments. At the bottom is a set of proposed requirements (for my clients) on applications that receive iCalendar files.

Some questions:
 
- Have I captured all your comments? Any more comments?
- Are the proposed requirements sensible? Any more requirements?

/Roger

Issue: Folding and unfolding content lines in iCalendar files

The iCalendar specification [RFC 5545] says that a content line should not be longer than 75 octets:

        Lines of text SHOULD NOT be longer
                   than 75 octets, excluding the line break.
 
The RFC says that long lines should be folded:

        Long content lines SHOULD be split
         into a multiple line representations
         using a line "folding" technique.
         That is, a long line can be split between
         any two characters by inserting a CRLF
         immediately followed by a single linear
         white-space character (i.e., SPACE or HTAB).

The RFC says that, when parsing a content line, folded lines must first be unfolded:

        When parsing a content line, folded lines MUST
         first be unfolded.

using this technique:

        Unfolding is accomplished by removing the
         CRLF and the linear white-space character
         that immediately follows.

The RFC acknowledges that some implementations might do folding in the middle of a multi-octet sequence:

        Note: It is possible for very simple
        implementations to generate improperly
         folded lines in the middle of a UTF-8
         multi-octet sequence. For this reason,
         implementations need to unfold lines
         in such a way to properly restore the
         original sequence.

Here is an example of folding in the middle of a UTF-8 multi-octet sequence:

The iCalendar file contains the Yen sign (U+00A5), which is represented by the byte sequence 0xC2 0xA5 in UTF-8. The content line containing the Yen sign is folded in the middle of the two bytes. The result is 0xC2 0x0D 0x0A 0x20 0xA5, which isn't valid UTF-8 any longer.

Proposed requirements on the behavior of applications that receive iCalendar files:

1. (Bug) The receiving application does not recognize that it has received an iCalendar file.

2. (Bug) The sending application performs the folding process - inserts CRLF plus white space characters - and the receiving application does the unfolding process but doesn't properly delete all of them.

3. (Non-conformant behavior) The receiving application, after folding and before unfolding, attempts to interpret the partial UTF-8 sequences and convert them into replacement characters or worse.
Received on Mon Jul 24 2017 - 12:58:18 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 24 2017 - 12:58:19 CDT