Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

From: J Decker via Unicode <unicode_at_unicode.org>
Date: Mon, 24 Jul 2017 14:23:12 -0700

On Mon, Jul 24, 2017 at 1:50 PM, Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> 2017-07-24 21:12 GMT+02:00 J Decker via Unicode <unicode_at_unicode.org>:
>
>>
>>
>> If you don't have that last position in a variable, just use 3 tests but
> NO loop at all: if all 3 tests are failing, you know the input was not
> valid at all, and the way to handle this error will not be solved simply by
> using a very unsecure unbound loop like above but by exiting and returning
> an error immediately, or throwing an exception.
>
> The code should better be:
>
> if (from[0]&0xC0 == 0x80) from--;
> else if (from[-1]&0xC0 == 0x80) from -=2;
> else if (from[-2]&0xC0 == 0x80) from -=3;
> if (from[0]&0xC0 == 0x80) throw (some exception);
> // continue here with character encoded as UTF-8 starting at "from"
> (an ASCII byte or an UTF-8 leading byte)
>
>
I generally accepted any utf-8 encoding up to 31 bits though ( since I was
going from the original spec, and not what was effective limit based on
unicode codepoint space) and the while loop is more terse; but is less
optimal because of code pipeline flushing from backward jump; so yes if
series is much better :) (the original code also has the start of the
string, and strings are effecitvly prefixed with a 0 byte anyway because of
a long little endian size)

and you'd probably be tracking an output offset also, so it becomes a
little longer than the above.

And it should be secured using a guard byte at start of your buffer in
> which the "from" pointer was pointing, so that it will never read something
> else and can generate an error.
>
>
Received on Mon Jul 24 2017 - 16:23:34 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 24 2017 - 16:23:35 CDT