RE: Perception that Unicode is 16-bit (was: Re: Surrogate space i

From: Carl W. Brown (
Date: Thu Feb 22 2001 - 14:45:02 EST


You comment about Microsoft having pie in its face is a bit puzzling. They
based NT on Unicode 1.0 and Windows 2000 which was sent to manufacturing 15
months ago has surrogate support. For all its faults MS has been a big
promoter of Unicode.

What burns me up is Sun implementing a non-Unicode wchar_t or worse yet
Oracle proposing to encode surrogates into UTF-8 as two UCS-2 characters.
This bastardized UTF-8 will probably decode properly but it will not deal
with properly encoded UTF-8. Also non-plain 0 characters will take 6 bytes
instead of 4.


-----Original Message-----
From: Joel Rees []
Sent: Wednesday, February 21, 2001 8:55 PM
To: Unicode List
Cc: Unicode List
Subject: Re: Perception that Unicode is 16-bit (was: Re: Surrogate space


Would you mind if I re-post my reply that I forget to cc to the list?

--------------------------- missing post ----------------------------

What exactly _would_ be wrong with calling UNICODE a thirty-two bit encoding
with a couple of ways to represent most of the characters in a smaller
number of bits? From a practical perspective, that would seem the most
correct and least misleading to me. (For example, no one writing a character
handling library is going to try to declare a 24 bit type for characters,
and no one writing a real character handling library is going to try to
build flat monolithic classification tables for the whole 90,000+ in the
current set anyway.)

I do realize that some managers at (particularly) Sun and Microsoft are
probably still feeling a little like they've got pie in the face because
their wonderful 16 bit character types turned out not to be as simple a
solution as they claimed they would.

Btw, saying approximately 20.087 bits (Am I calculating that right --
log2[ 17*65536 ]?) causes many people to think they are just being teased.

Now I happen to be of the opinion that the attempt to proclaim the set
closed at 17 planes is a little premature. It's the newby in me, I'm sure,
but I still remember that disconcerted feeling I got when my freshman
Algebra for CS teacher pointed out that real character sets are by principle
not subject to closure -- something like the churning in the stomach I got
when thinking of writing a program that would fill more than 64K of memory.

Joel Rees, Media Fusion KK
Amagasaki, Japan

----- Original Message -----
From: "Marco Cimarosti" <>
To: "Unicode List" <>
Sent: Wednesday, February 21, 2001 5:53 PM
Subject: RE: Perception that Unicode is 16-bit (was: Re: Surrogate space i

> Peter Constable:
> > On 02/20/2001 03:34:28 AM Marco Cimarosti wrote:
> > > "Unicode is now a 32-bit character encoding standard,
> > > although only about one million of codes actually exist,
> > > [...]
> >
> > Well, it's probably a better answer to say that Unicode is a 20.1-bit
> > encoding since the direct encoding of characters is the coded [...]
> Your explanation is very correct. This is precisely how I used to start my
> endless explanations to those colleagues :-) And they invariably
> the explanation asking: "So, how many bits does it have?"
> That's why I wanted to simplify even more saying something like: "It is 32
> bits (yes 32, like 4 bytes, OK?) but, as not all combinations are used,
> there are techniques to shrink it down a lot."
> _ Marco

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT