RE: What to backup after corruption of code units? from Phillips, Addison on 2013-08-27 (Unicode Mail List Archive)

From: Phillips, Addison <addison_at_lab126.com>
Date: Wed, 28 Aug 2013 04:15:01 +0000

"Back up" here refers to decrementing the pointer in the string.

If you have a string consisting of the following UTF-16 code units, for example:

00C0 0020 20AC D800 DC00 00C5
0 1 2 3 4 5

If you set the pointer to code unit number 4 (counting from 0), you'll be pointed at "DC00", which is a trailing ("low") surrogate. The pointer needs to "back up" (decrement) by one to position 3 (0xD800) to find the start of the character (each of the other code units refers to a single code point).

Addison Phillips
Globalization Architect (Amazon Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.

> -----Original Message-----
> From: unicode-bounce_at_unicode.org [mailto:unicode-bounce_at_unicode.org] On
> Behalf Of Xue Fuqiao
> Sent: Tuesday, August 27, 2013 6:37 PM
> To: unicode_at_unicode.org
> Subject: What to backup after corruption of code units?
>
> Hi list,
>
> I'm reading Unicode 6.2.0 and have a question. In Section 2.5, Encoding Forms:
>
> For example, when randomly accessing a string, a program can find the
> boundary of a character with limited backup. In UTF-16, if a pointer
> points to a leading surrogate, a single backup is required. In UTF-8,
> if a pointer points to a byte starting with 10xxxxxx (in binary), one
> to three backups are required to find the beginning of the character.
>
> What does the "backup" mean here? What does the program backup?
>
> I searched "backup" with unicode.org/search/ but didn't get anything that
> looked promising. Can anyone point me in the right direction?
>
> (English is not my native language; please excuse typing errors.)
>
> --
> Best regards, Xue Fuqiao.
Received on Tue Aug 27 2013 - 23:43:05 CDT

This archive was generated by hypermail 2.2.0 : Tue Aug 27 2013 - 23:43:07 CDT