When thinking about using surrogate pairs of 16 bit unicode characters to
express a 21 bit unicode character I like to think in terms of an analogy of
a Medieval Great Field divided into strips for cultivation. A road runs
along one edge of the field, perpendicular to the strips, so that someone
may gain access to a particular place on a particular strip by using the
road to get to the near end of the strip and then proceeding along the
strip. The high order bits part of the pair of 16 bit unicode characters
denoting which strip is being considered and the low order bits part of the
pair of 16 bit unicode characters denoting how far from the road along the
strip that one is located.
Suppose that one has a document, say a chapter from a novel, that consists
of a sequence of unicode characters that are each more than 16 bits in
significance and one wishes to represent them using a sequence of 16 bit
unicode characters. Suppose that, continuing my analogy, that all of the
characters are located in the same strip of the great field. Suppose that
there are n characters in the sequence of 21 bit characters. Would the
sequence of sixteen bit characters contain 2n or n+1 characters or some
other number? That is, once a 16 bit character that is indicating high
order bits has been used, is there a presumption that any number of
following 16 bit characters that are indicating low order bits are all to be
considered as indicating a character in the most recent "great field strip"
indicated, or does one need to use a high and low pair for each character
from the great field, even if that means continual repetition of the same
high order bits indicating character?
I can imagine advantages for both types of usage, a persistence of meaning
rule would save a lot of space in a 16 bit character file, yet cutting and
pasting a document could possibly cause problems if the high order bits have
not been stated in the section that is being cut and pasted.
22 February 2001
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT