Hypersurrogates: a proposed convention for ISO 10646 -> Unicode mapping

From: John Cowan (jcowan@reutershealth.com)
Date: Wed Nov 17 1999 - 11:48:21 EST


Since WG2 has agreed not to assign ISO 10646 characters outside the Unicode
range (U+0000 to U-0010FFFF), the only ISO 10646 characters outside that
range are the deprecated private-use areas at 00E00000-00FFFFFF (2,097,151
codepoints) and at 60000000-7FFFFFFF (536,870,911 codepoints).

In anticipation of (a) somebody actually using those codes and (b) wishing
to communicate with the Unicode-speaking world, I hereby propose a mapping
convention called "hypersurrogates" to provide the necessary mapping.
This mapping convention maps 10646 private-use characters to Unicode
private-use characters, and so need not have official cognizance.

The idea of hypersurrogates is akin to that of surrogates. Whereas two
16-bit surrogates make a Unicode-range character, two Unicode-range
characters make a full 32-bit character. The mapping convention is that
any ISO private use character 'abcdefgh' in the above range is mapped
to the hypersurrogate pair U+000Fabcd followed by U+0010efgh. Furthermore,
ISO private-use characters in the range 00F00000-0010FFFF are mapped
to hypersurrogate pairs as well, where the first character is either
U-000F000F or U-000F0010. So the valid range of hypersurrogate characters
in plane 000F is 000F-0010, 00E0-00FF, and 6000-7FFF. Other plane
000F characters never appear when hypersurrogates are in use.

In UCS-4/UTF-32 encoding, hypersurrogates cause a 100% growth in octet size,
from 4 octets to 8. In UTF-8 encoding, hypersurrogates cause only a 50%
growth, from 6 octets to 8.

-- 

John Cowan http://www.reutershealth.com jcowan@reutershealth.com Schlingt dreifach einen Kreis vom dies! / Schliess eurer Aug vor heiliger Schau Den er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:56 EDT