In a message dated 2001-02-22 04:28:10 Pacific Standard Time,
> Suppose that one has a document, say a chapter from a novel, that consists
> of a sequence of unicode characters that are each more than 16 bits in
> significance and one wishes to represent them using a sequence of 16 bit
> unicode characters. Suppose that, continuing my analogy, that all of the
> characters are located in the same strip of the great field. Suppose that
> there are n characters in the sequence of 21 bit characters. Would the
> sequence of sixteen bit characters contain 2n or n+1 characters or some
> other number? That is, once a 16 bit character that is indicating high
> order bits has been used, is there a presumption that any number of
> following 16 bit characters that are indicating low order bits are all to
> considered as indicating a character in the most recent "great field strip"
> indicated, or does one need to use a high and low pair for each character
> from the great field, even if that means continual repetition of the same
> high order bits indicating character?
Yes. As Marco Cimarosti has indicated, each supplementary character is
represented in UTF-16 by a surrogate *pair*. Both surrogates need to be
specified each time. Consequently, a stream of Deseret text (for example)
will contain a lot of U+D801's.
Since the code points used for high surrogates are separate from those used
for low surrogates, UTF-16 could have been designed to work the way you
described, but it was not. (Note that I did not say it SHOULD have been done
The "persisting" mechanism you describe is part of the Standard Compression
Scheme for Unicode (SCSU), which is described in Unicode Technical Standard
#6. See <http://www.unicode.org/unicode/reports/tr6/> for more information.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT