Re: What to backup after corruption of code units? from Bill Poser on 2013-08-27 (Unicode Mail List Archive)

From: Bill Poser <billposer2_at_gmail.com>
Date: Tue, 27 Aug 2013 21:06:12 -0700

"backup" in this context refers to moving to previous bytes in order to
find the boundary between the previous, valid character, and the corrupted
character that you have encountered. In other words if you have a string
consisting of N bytes and at byte K you determine that the current sequence
of bytes is not a valid UTF-8 encoding, you can conclude that the bytes up
to K-J are a valid sequence and that the corruption starts aftter that,
where J={1,2,3}.

On Tue, Aug 27, 2013 at 6:36 PM, Xue Fuqiao <xfq.free_at_gmail.com> wrote:

> Hi list,
>
> I'm reading Unicode 6.2.0 and have a question. In Section 2.5, Encoding
> Forms:
>
> For example, when randomly accessing a string, a program can find the
> boundary of a character with limited backup. In UTF-16, if a pointer
> points to a leading surrogate, a single backup is required. In UTF-8,
> if a pointer points to a byte starting with 10xxxxxx (in binary), one
> to three backups are required to find the beginning of the character.
>
> What does the "backup" mean here? What does the program backup?
>
> I searched "backup" with unicode.org/search/ but didn't get anything
> that looked promising. Can anyone point me in the right direction?
>
> (English is not my native language; please excuse typing errors.)
>
> --
> Best regards, Xue Fuqiao.
>
>
Received on Tue Aug 27 2013 - 23:08:09 CDT

This archive was generated by hypermail 2.2.0 : Tue Aug 27 2013 - 23:08:09 CDT