(M) Scanning UTF-8 backwards is possible?

From: Marco Mussini (marco.mussini@vim.tlt.alcatel.it)
Date: Mon Aug 31 1998 - 04:29:57 EDT


Hi,

Some people here told me that thanks to the particular structure of the
UTF-8 encoding, you can look at any byte and immediately know where you
are.

The first byte of any sequence_that_represents_a_character_in_UTF8 has
always the most significant bit set to zero. This makes it perfectly
compatible and undistinguishable with 7-bit ASCII whan it is encoding
"regular" US ASCII data.
The second byte (if any) has the most significant bit set to 1 and the
next N most significant bits set to 1 where N is the number of other
bytes that will follow to end the current
sequence_that_represents_a_character_in_UTF8.

For example, if we have a two byte sequence to represent a character, we
will have the bits as follows:

0xxxxxxx 1xxxxxxx

Three-byte sequence:

0xxxxxxx 11xxxxxx 1xxxxxxx

Four-byte sequence:

0xxxxxxx 111xxxxx 11xxxxxx 1xxxxxxx

Single byte character:

0xxxxxxx

So if you look at a byte you can immediately tell where you are.

Going backwards 1 character requires simply to reach a byte with the MSB
set to zero.

Can you confirm this?

I read somebody in this list claiming that UTF-8x (note the "x") is not
backwards scannable unless it is rewound to the start. What's UTF-8x and
why it became non-backwards scannable?

--M



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT