Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets) from Ken Whistler on 2011-08-19 (Unicode Mail List Archive)

From: Ken Whistler <kenw_at_sybase.com>
Date: Fri, 19 Aug 2011 15:24:59 -0700

On 8/19/2011 2:07 PM, Doug Ewell wrote:
> Technically, I think 10646 was always limited to 32,768 planes so that
> one could always address a code point with a 32-bit signed integer (a
> nod to the Java fans).

Well, yes, but it didn't really have anything to do with Java. Remember
that Java
wasn't released until 1995, but the 10646 architecture dates back to
circa 1986.
So more likely it was a nod to C implementations which would, it was
supposed,
have implemented the 2-, 3-, or 4-octet forms of 10646 with a wchar_t,
and which
would have wanted a signed 32 bit type to work. I suspect, by the way,
that that
limitation was probably originally brought to WG2 by the U.S. national body,
as they would have been the ones most worried about the C implementations
of 10646 multi-octet forms.

And the original architecture was also not really a full 32K planes in
the sense
that we now understand planes for Unicode and 10646. The original design
for 10646 was for a 1- to 4-octet encoding, with all octets conforming
to the
ISO 2022 specification. It used the option that the "working sets" for the
encoding octets would be the 94-unit ranges. So for G0: 0x21..0x7E and
for G1: 0xA1..0xFE. The other bytes C0, 0x20, 0x7F, C1, 0xA0, 0xFF, were
not used except for the single-octet form, as in 2022-conformant schemes
still used today for some East Asian character encodings.

And the octets were then designated G (group) P (plane) R (row) and C.

The 1-octet form thus allowed 95 + 96 = 191 code positions.

The 2-octet form thus allowed (94 + 94)^2 = 35,344 code positions

The 3-octet form thus allowed (94 + 94)^3 = 6,644,672 code positions

The Group octet was constrained to the low set of 94. (This is the origin
of the constraint to half the planes, which would keep wchar_t
implementations
out of negative signed range.)

The 4-octet form thus allowed 94 * (94 +94)^3 = 624,599,168 code positions

The grand total for all possible forms was the sum of those values or:

*631,279,375* code positions

(before various *other* set-asides for "plane swapping" and private
use start getting taken into account)

>
> Of course, 2.1 billion characters is also overkill, but the advent of
> UTF-16 was how we ended up with 17 planes.

So a lot less than 2.1 billion characters. But I think Doug's point is
still valid:
631 million plus code points was still overkill for the problem to
be addressed.

And I think that we can thank our lucky stars that it isn't *that*
architecture for
a universal character encoding that we would now be implementing and
debating on
the alternative universe version of this email list. ;-)

--Ken
Received on Fri Aug 19 2011 - 17:26:57 CDT

This archive was generated by hypermail 2.2.0 : Fri Aug 19 2011 - 17:27:08 CDT