Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored? from Philippe Verdy via Unicode on 2017-07-24 (Unicode Mail List Archive)

From: Philippe Verdy via Unicode <unicode_at_unicode.org>
Date: Mon, 24 Jul 2017 17:39:46 +0200

Also note that the maximum line-length in that RFC is a SHOULD and not a
MUST. This is intended to give a reasonable hint for the limit used in
implementations that process data in the given format: The RFC suggests a
maximum line length of 75 "characters", excluding the CRLF+SPACE
continuation sequence (not clear here what it means given that it refers to
UTF-8: should it be "code units", i.e. bytes?)

Due to this ambiguity, all implementations will need to interpret it as id
they are actually 75 Unicode characters, which could all be up to 4 bytes
in UTF-8, i.e. 300 bytes. Most implementations will use input buffers for
lines up to 512 bytes (including the CRLF+SPACE continuation), so it will
be simpler to handle the case of continuation just AFTER the line length
limit has been reached, without ever rolling back. And in all cases, there
should never be any continuation sequence CRLF+SPACE in the middle of any
UTF-8 sequence without breaking the initial UTF-8 condition which is
assumed by theis RFC, i.e. without breaking conformance to that RFC.

If an implementation thinks that 75 is a number of bytes, it is wrong, but
anyway given the UTF-8 reference, it could still use it but should not
break in the middle of an UTF-8 sequence, but it will be still safe for
them to break just after it, even if the line (excluding the the CRLF+SPACE
contituation sequence) will be up to 78 bytes long. Decoders will still be
able to parse it without breaking if they have the most common 512-byte
input buffer.

2017-07-24 17:27 GMT+02:00 Philippe Verdy <verdy_p_at_wanadoo.fr>:

> But at the same time that RFC makes a direct reference as UTF-8 as being
> the default charset, so an implementation of the RFC cannot be agnostic to
> what is UTF-8 and will not break in the middle of a conforming UTF-8
> sequence.
>
> When the limit is reached, that implementations knows that it cannot cut
> at a position of an UTF-8 trailing byte, and knows that it can safely
> rollaback at most 3 bytes before to locate conforming leading UTF-8 byte to
> split the line **before** it, or any 7-bit ASCII byte to split the line
> just **after** it). This requires very small buffering and this is a
> fundamendal property of UTF-8.
>
> Other character sets -- including /UTF-(16|32)([LB]E)?/ !!! --- are not
> directly supported, except by external decoders which would convert their
> input stream to UTF-8 (with all the same issues that may occur for such
> conversion when it is not roundtrip compatible or the input does not
> conform the specificvation of the input charset, but this is not the
> problem of this RFC: these decoders may also rollback internally or attempt
> to guess another charset or will use substitution, but they are supposed to
> generate conforming UTF-8 on output).
>
>
> 2017-07-24 17:01 GMT+02:00 Steffen Nurpmeso via Unicode <
> unicode_at_unicode.org>:
>
>> "Costello, Roger L. via Unicode" <unicode_at_unicode.org> wrote:
>> |Suppose an application splits a UTF-8 multi-octet sequence. The
>> application \
>> |then sends the split sequence to a client. The client must restore \
>> |the original sequence.
>> |
>> |Question: is it possible to split a UTF-8 multi-octet sequence in such \
>> |a way that the client cannot unambiguously restore the original
>> sequence?
>> |
>> |Here is the source of my question:
>> |
>> |The iCalendar specification [RFC 5545] says that long lines must be
>> folded:
>> |
>> | Long content lines SHOULD be split
>> | into a multiple line representations
>> | using a line "folding" technique.
>> | That is, a long line can be split between
>> | any two characters by inserting a CRLF
>> | immediately followed by a single linear
>> | white-space character (i.e., SPACE or HTAB).
>> |
>> |The RFC says that, when parsing a content line, folded lines must first
>> \
>> |be unfolded using this technique:
>> |
>> | Unfolding is accomplished by removing
>> | the CRLF and the linear white-space
>> | character that immediately follows.
>> |
>> |The RFC acknowledges that simple implementations might generate
>> improperly \
>> |folded lines:
>> |
>> | Note: It is possible for very simple
>> | implementations to generate improperly
>> | folded lines in the middle of a UTF-8
>> | multi-octet sequence. For this reason,
>> | implementations need to unfold lines
>> | in such a way to properly restore the
>> | original sequence.
>>
>> That is not what the RFC says. It says that simple
>> implementations simply split lines when the limit is reached,
>> which might be in the middle of an UTF-8 sequence. The RFC is
>> thus improved compared to other RFCs in the email standard
>> section, which do not give any hints on how to do that. Even
>> RFC 2231, which avoids many of the ambiguities and problems of RFC
>> 2047 (for a different purpose, but still), does not say it so
>> exactly for the reversing character set conversion (which i for
>> one perform _once_ after joining together the chunks, but is not
>> a written word and, thus, ...).
>>
>> --steffen
>> |
>> |Der Kragenbaer, The moon bear,
>> |der holt sich munter he cheerfully and one by one
>> |einen nach dem anderen runter wa.ks himself off
>> |(By Robert Gernhardt)
>>
>
>
Received on Mon Jul 24 2017 - 10:40:31 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 24 2017 - 10:40:31 CDT