Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

Date: Mon Feb 19 2001 - 21:04:23 EST

A few days ago I said there was a "widespread belief" that Unicode is a
16-bit-only character set that ends at U+FFFF. A corollary is that the
supplementary characters ranging from U+10000 to U+10FFFF are either
little-known or perceived to belong to ISO/IEC 10646 only, not to Unicode.

At least one list member questioned whether this belief was really widespread.

Here is an example from the help file for Character Map in Microsoft Windows
2000. Visit "Character Map overview" and click on the word "Unicode" to see
the following definition:

"A 16-bit character encoding standard developed by the Unicode Consortium
between 1988 and 1991. By using two bytes to represent each character,
Unicode enables almost all of the written languages of the world to be
represented using a single character set. By contrast, 8-bit ASCII is not
capable of representing all of the combinations of letters and diacritical
marks that are used just with the Roman alphabet.

"Approximately 39,000 of the 65,536 possible Unicode character codes have
been assigned to date, 21,000 of them being used for Chinese ideographs. The
remaining combinations are open for expansion.

"See also ASCII."

Exercise for the reader: See how many misstatements about Unicode (and
ASCII) you can find in this text.

-Doug Ewell
 Fullerton, California

