Re: Do 16 bit surrogate high bits indicating characters have a persisting mea...

Date: Thu Feb 22 2001 - 11:15:58 EST

In a message dated 2001-02-22 04:28:10 Pacific Standard Time, writes:

> Suppose that one has a document, say a chapter from a novel, that consists
> of a sequence of unicode characters that are each more than 16 bits in
> significance and one wishes to represent them using a sequence of 16 bit
> unicode characters. Suppose that, continuing my analogy, that all of the
> characters are located in the same strip of the great field. Suppose that
> there are n characters in the sequence of 21 bit characters. Would the
> sequence of sixteen bit characters contain 2n or n+1 characters or some
> other number? That is, once a 16 bit character that is indicating high
> order bits has been used, is there a presumption that any number of
> following 16 bit characters that are indicating low order bits are all to
> considered as indicating a character in the most recent "great field strip"
> indicated, or does one need to use a high and low pair for each character
> from the great field, even if that means continual repetition of the same
> high order bits indicating character?

Yes. As Marco Cimarosti has indicated, each supplementary character is
represented in UTF-16 by a surrogate *pair*. Both surrogates need to be
specified each time. Consequently, a stream of Deseret text (for example)
will contain a lot of U+D801's.

Since the code points used for high surrogates are separate from those used
for low surrogates, UTF-16 could have been designed to work the way you
described, but it was not. (Note that I did not say it SHOULD have been done
that way.)

The "persisting" mechanism you describe is part of the Standard Compression
Scheme for Unicode (SCSU), which is described in Unicode Technical Standard
#6. See <> for more information.

-Doug Ewell
 Fullerton, California

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT