Re: What to backup after corruption of code units?

From: Asmus Freytag <>
Date: Wed, 28 Aug 2013 18:44:11 -0700

On 8/28/2013 6:25 PM, Karl Williamson wrote:
> On 08/28/2013 06:52 PM, Asmus Freytag wrote:
>> On 8/28/2013 5:19 PM, Doug Ewell wrote:
>>> Actually 0xC2, according to the rules of UTF-8.
>> Hmm. What you are referring to is that 0xC0 and 0xC1 don't occur because
>> of the requirement for minimal length encoding. However, a check for
>> >=0xC0 will give the correct result for backing up, assuming the data
>> is valid UTf-8 (or at least locally valid).
>> In terms of boundary determination, would you take violating the rule
>> about minimal length encoding as evidence for corrupted data, or would
>> you first detect the boundary, then decide that a sequence starting with
>> 0xC0 is in violation?
>> A./
> I have code that does the latter, and it works well. But this may be
> colored by the fact that there was a design constraint to accept
> overlongs if the caller of the parsing subroutine sets a flag to allow
> them.
Ok, so the point to my questions was to determine whether it would make
sense (be imperative) to change the text passage in question which dates
from Unicode 4.0.0 if not earlier.

Given your (Karl's) statement, I'm comfortable with suggesting to leave
it as is.

Received on Wed Aug 28 2013 - 20:46:10 CDT

This archive was generated by hypermail 2.2.0 : Wed Aug 28 2013 - 20:46:10 CDT