Re: Perception that Unicode is 16-bit (was: Re: Surrogate space i

From: Joel Rees (
Date: Thu Feb 22 2001 - 00:06:29 EST


Would you mind if I re-post my reply that I forget to cc to the list?

--------------------------- missing post ----------------------------

What exactly _would_ be wrong with calling UNICODE a thirty-two bit encoding
with a couple of ways to represent most of the characters in a smaller
number of bits? From a practical perspective, that would seem the most
correct and least misleading to me. (For example, no one writing a character
handling library is going to try to declare a 24 bit type for characters,
and no one writing a real character handling library is going to try to
build flat monolithic classification tables for the whole 90,000+ in the
current set anyway.)

I do realize that some managers at (particularly) Sun and Microsoft are
probably still feeling a little like they've got pie in the face because
their wonderful 16 bit character types turned out not to be as simple a
solution as they claimed they would.

Btw, saying approximately 20.087 bits (Am I calculating that right --
log2[ 17*65536 ]?) causes many people to think they are just being teased.

Now I happen to be of the opinion that the attempt to proclaim the set
closed at 17 planes is a little premature. It's the newby in me, I'm sure,
but I still remember that disconcerted feeling I got when my freshman
Algebra for CS teacher pointed out that real character sets are by principle
not subject to closure -- something like the churning in the stomach I got
when thinking of writing a program that would fill more than 64K of memory.

Joel Rees, Media Fusion KK
Amagasaki, Japan

----- Original Message -----
From: "Marco Cimarosti" <>
To: "Unicode List" <>
Sent: Wednesday, February 21, 2001 5:53 PM
Subject: RE: Perception that Unicode is 16-bit (was: Re: Surrogate space i

> Peter Constable:
> > On 02/20/2001 03:34:28 AM Marco Cimarosti wrote:
> > > "Unicode is now a 32-bit character encoding standard,
> > > although only about one million of codes actually exist,
> > > [...]
> >
> > Well, it's probably a better answer to say that Unicode is a 20.1-bit
> > encoding since the direct encoding of characters is the coded [...]
> Your explanation is very correct. This is precisely how I used to start my
> endless explanations to those colleagues :-) And they invariably
> the explanation asking: "So, how many bits does it have?"
> That's why I wanted to simplify even more saying something like: "It is 32
> bits (yes 32, like 4 bytes, OK?) but, as not all combinations are used,
> there are techniques to shrink it down a lot."
> _ Marco

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT