Re: What to backup after corruption of code units? from Karl Williamson on 2013-08-28 (Unicode Mail List Archive)

From: Karl Williamson <public_at_khwilliamson.com>
Date: Wed, 28 Aug 2013 19:25:07 -0600

On 08/28/2013 06:52 PM, Asmus Freytag wrote:
> On 8/28/2013 5:19 PM, Doug Ewell wrote:
>> Actually 0xC2, according to the rules of UTF-8.
>
> Hmm. What you are referring to is that 0xC0 and 0xC1 don't occur because
> of the requirement for minimal length encoding. However, a check for
> >=0xC0 will give the correct result for backing up, assuming the data
> is valid UTf-8 (or at least locally valid).
>
> In terms of boundary determination, would you take violating the rule
> about minimal length encoding as evidence for corrupted data, or would
> you first detect the boundary, then decide that a sequence starting with
> 0xC0 is in violation?
>
> A./
>>

I have code that does the latter, and it works well. But this may be
colored by the fact that there was a design constraint to accept
overlongs if the caller of the parsing subroutine sets a flag to allow them.
Received on Wed Aug 28 2013 - 20:27:18 CDT

This archive was generated by hypermail 2.2.0 : Wed Aug 28 2013 - 20:27:19 CDT