RE: Do 16 bit surrogate high bits indicating characters have a pe rsisting meaning please?

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu Feb 22 2001 - 08:33:37 EST


William Overington imagined:
> When thinking about using surrogate pairs of 16 bit unicode
> characters to express a 21 bit unicode character I like to
> think in terms of an analogy of a Medieval Great Field
> divided into strips for cultivation.

That's what freedom of thought is for: allowing people to think in the terms
they prefer. :-)

William Overington asked:
> Suppose that one has a document [...] that consists of a
> sequence of unicode characters that are each more than
> 16 bits [...] all of the characters are located in the
> same strip of the great field. Suppose that there are n
> characters [...]
> Would the sequence of sixteen bit characters contain
> 2n or n+1 characters or some other number?

You always restart from the king's manor and walk down the central street
each time.

That is, each character has its high and low surrogates, even if the high
surrogate is always the same for all characters. So you need 2n "code units"
(not "characters") to encode n characters.

As you said, both approaches have their advantages and disadvantages.

The method that you suggest (that would be called a "shifted encoding", and
is actually used in some Far East double-byte encodings) is clearly more
economic, in terms of memory usage, but is very vulnerable. The weak point
is the high surrogate code unit which determines the interpretation of a
whole sequence of low surrogates code unit. You can imagine what happens if
*that* very code unit gets corrupted! Your whole novel could become garbage,
because the high bits of each characters would be wrong.

On the other hand, Unicode's method (which is called UTF-16, by the way) may
be considered redundant. But it is exactly this redundancy that makes it
much more secure. In fact, if one code unit gets corrupted (either a high
surrogate, a low surrogate, or a standalone code unit) it is guaranteed that
exactly *one* character will be corrupted.

See the UTR#17 (http://www.unicode.org/unicode/reports/tr17) for more
details.

Hoping this helps.
Marco

P.S. I was surprised by your mail because, by coincidence, I have been
reasoning along similar lines for a few days (well, although without
Mediaeval feuds), balancing the pros and cons of the two methods. The
tentative conclusion that I came to is that an hypothetical alternative
approach should offer a vvvvvvvery big economy of memory to repay the
security features offered by existing UTF's.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT