Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

From: Philippe Verdy via Unicode <>
Date: Mon, 24 Jul 2017 22:50:05 +0200

2017-07-24 21:12 GMT+02:00 J Decker via Unicode <>:

> On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode <
>> wrote:
>> Hi Folks,
>> 2. (Bug) The sending application performs the folding process - inserts
>> CRLF plus white space characters - and the receiving application does the
>> unfolding process but doesn't properly delete all of them.
>> The RFC doesn't say 'characters' but either a space or a tab character
> (singular)
> back scanning is simple enough
> while( ( from[0] & 0xC0 ) == 0x80 )
> from--;

Certainly not like this! Backscanning should only directly use a single
assignement to the last known start position, no loop at all ! UTF-8
security is based on the fact that its sequences are strictly limited in
length so that you will never have more than 3 trailing bytes.

If you don't have that last position in a variable, just use 3 tests but NO
loop at all: if all 3 tests are failing, you know the input was not valid
at all, and the way to handle this error will not be solved simply by using
a very unsecure unbound loop like above but by exiting and returning an
error immediately, or throwing an exception.

The code should better be:

    if (from[0]&0xC0 == 0x80) from--;
    else if (from[-1]&0xC0 == 0x80) from -=2;
    else if (from[-2]&0xC0 == 0x80) from -=3;
    if (from[0]&0xC0 == 0x80) throw (some exception);
    // continue here with character encoded as UTF-8 starting at "from" (an
ASCII byte or an UTF-8 leading byte)

And it should be secured using a guard byte at start of your buffer in
which the "from" pointer was pointing, so that it will never read something
else and can generate an error.
Received on Mon Jul 24 2017 - 15:50:46 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 24 2017 - 15:50:47 CDT