Re: What to backup after corruption of code units?

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 28 Aug 2013 07:44:36 +0200

The term is probably badly chosen but it means that you must read backward
from the start position.
The term "backup" is not related to any data copying/saving operation.

- in UTF-16 there's an error in your citation: if you find a leading
surrogate (in 0xD800..0xDBFF), you are already at thecorrct position to
read the next code unit (which should then be the trailing surrogate in
0xDC00..0xDFFF). Otherwise you need to read from the previous position
which should be the leading surrogate.

- in UTF-8, you'll need to look backward between 1 to 3 positions before
your start position to find the leading 8-bit code unit (>= 0xC0).

In both cases you have to check the value found. If you don't find it, in
the limited range of positions, the input is not valid UTF-8 or UTF-16 and
you have to handle an encoding error exception in the input stream.

The Unicode standarddoes not specify how you'll handle this error situation
or from where you'll be able to resync the stream, or even if you should
resync from some further position; this is application-dependant. If the
input stream is live (for example coming from a broadcasted media), you'll
probably want to just skip the error, invalidate the current state, signal
an error to the user in some way, and then try to restart from the next
valid position. But data truncation will occur and there's no easy way to
determine if your text stream will parse correctly.

If the input text stream is a script with its own syntax, the script will
not process correctly and its interpretation or compilation should be
stopped with an exception thrown or error status returned to the client API.

But if the stream is just some readable text (e.g. subtitles text displayed
on a video), the user will jsut see a part of the text, but the video will
continue reading.If the input text is an ongoing chat discussion, some of
thediscussion will be truncated but the discussion will continue from
there. If the input text is from a file or from an data structure supposed
to contain the full text, the file or data structure is corrupted.
Depending on cases this could be an internal software bug, or a reliability
problem from the storage, or from the transmission medium or network error.
This could as well be an input stream that was actually not encoded with
this UTF (you may retry guessing which text encoding was used, not
necessarily an UTF).

2013/8/28 Xue Fuqiao <xfq.free_at_gmail.com>

> Hi list,
>
> I'm reading Unicode 6.2.0 and have a question. In Section 2.5, Encoding
> Forms:
>
> For example, when randomly accessing a string, a program can find the
> boundary of a character with limited backup. In UTF-16, if a pointer
> points to a leading surrogate, a single backup is required. In UTF-8,
> if a pointer points to a byte starting with 10xxxxxx (in binary), one
> to three backups are required to find the beginning of the character.
>
> What does the "backup" mean here? What does the program backup?
>
> I searched "backup" with unicode.org/search/ but didn't get anything
> that looked promising. Can anyone point me in the right direction?
>
> (English is not my native language; please excuse typing errors.)
>
> --
> Best regards, Xue Fuqiao.
>
>
Received on Wed Aug 28 2013 - 00:47:09 CDT

This archive was generated by hypermail 2.2.0 : Wed Aug 28 2013 - 00:47:10 CDT