Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

From: Philippe Verdy via Unicode <unicode_at_unicode.org>
Date: Mon, 24 Jul 2017 23:03:50 +0200

2017-07-24 22:50 GMT+02:00 Philippe Verdy <verdy_p_at_wanadoo.fr>:

> 2017-07-24 21:12 GMT+02:00 J Decker via Unicode <unicode_at_unicode.org>:
>
>>
>>
>> On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode <
>> unicode_at_unicode.org> wrote:
>>
>>> Hi Folks,
>>>
>>> 2. (Bug) The sending application performs the folding process - inserts
>>> CRLF plus white space characters - and the receiving application does the
>>> unfolding process but doesn't properly delete all of them.
>>>
>>> The RFC doesn't say 'characters' but either a space or a tab character
>> (singular)
>>
>> back scanning is simple enough
>>
>> while( ( from[0] & 0xC0 ) == 0x80 )
>> from--;
>>
>
> Certainly not like this! Backscanning should only directly use a single
> assignement to the last known start position, no loop at all ! UTF-8
> security is based on the fact that its sequences are strictly limited in
> length so that you will never have more than 3 trailing bytes.
>
> If you don't have that last position in a variable, just use 3 tests but
> NO loop at all: if all 3 tests are failing, you know the input was not
> valid at all, and the way to handle this error will not be solved simply by
> using a very unsecure unbound loop like above but by exiting and returning
> an error immediately, or throwing an exception.
>
> The code should better be:
>
> if (from[0]&0xC0 == 0x80) from--;
> else if (from[-1]&0xC0 == 0x80) from -=2;
> else if (from[-2]&0xC0 == 0x80) from -=3;
> if (from[0]&0xC0 == 0x80) throw (some exception);
> // continue here with character encoded as UTF-8 starting at "from"
> (an ASCII byte or an UTF-8 leading byte)
>
Sorry, sent too fast, I should not have copy-pasted lines trying to adapt
your loop; the correct code uses no "else" at all:

> if (from[0]&0xC0 == 0x80) from--;
> if (from[0]&0xC0 == 0x80) from--;
> if (from[0]&0xC0 == 0x80) from--;
> if (from[0]&0xC0 == 0x80) throw (some exception);
> // continue here with character encoded as UTF-8 starting at "from"
> (an ASCII byte or an UTF-8 leading byte)
>
>
Received on Mon Jul 24 2017 - 16:04:29 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 24 2017 - 16:04:29 CDT