RE: Perception that Unicode is 16-bit (was: Re: Surrogate space i n

From: Cathy Wissink (cwissink@microsoft.com)
Date: Tue Feb 20 2001 - 11:49:32 EST


The people who are responsible for this text have been made aware of the
problem. This will be updated for WindowsXP.

Cathy

-----Original Message-----
From: DougEwell2@cs.com [mailto:DougEwell2@cs.com]
Sent: Tuesday, February 20, 2001 8:04 AM
To: Unicode List
Subject: Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in

In a message dated 2001-02-20 04:21:49 Pacific Standard Time,
KNAPPEN@ALPHA.NTP.SPRINGER.DE writes:

> A little out of date, but describing correctly the state of art in 1991
> before the merger.

Agreed, but the example was from Windows 2000. It should at least be
current
through Unicode 2.1.

> Even 8-bit ASCII is a correct term meaning ISO-8859-1.

I would question that. Understandable, yes, but not really correct.

> A nit to pick: It's the latin alphabet, not roman. Roman is a kind of
> typeface, contrasting to sans serif aka grotesque.

True. I have also heard "roman" used to mean the opposite of italic.

> > Exercise for the reader: See how many misstatements about Unicode
(and
> > ASCII) you can find in this text.
>
> Fewer than you expect. Only the target described does not exist any
longer.
> Since the merger with ISO 10646 was forseeable even at that time, there
are
> no implementation of Unicode 1.0 anyway.

Here is my list. Remember that I am expecting information supplied with
Windows 2000 to be current through Unicode 2.1.

> A 16-bit character encoding standard

Wrong; surrogates have existed since about 1993 (someone help me with the
exact date).

> developed by the Unicode Consortium between 1988 and 1991.

This implies that development was finished in 1991, and only new characters
are added. In fact, lots of new development to Unicode has taken place
since
then (just look at all the TR's). This might be splitting hairs.

> By using two bytes to represent each character,

Even "16 bits" would be better than "two bytes" here, but again this is
nit-picking.

> Unicode enables almost all of the written languages of the world to be
> represented using a single character set.

Hey, they got something right!

> By contrast, 8-bit ASCII

Mentioned above.

> is not capable of representing all of the combinations of letters and
diacritical
> marks that are used just with the Roman alphabet.

I thought "Roman" was simply an alternate word for "Latin," but Jorg is
correct. This is also an error.

> Approximately 39,000 of the 65,536 possible Unicode character codes have
> been assigned to date, 21,000 of them being used for Chinese ideographs.

The count was correct once, but that was 10 years ago.

> The remaining combinations are open for expansion.

"Combinations"? You mean of two bytes?

Well, that's about enough. I am not a habitual Microsoft basher, but
somebody in their Help department really needs to update the information
distributed with their OS. Tex is right that we are bound to see a certain
amount of misinformation, but it is our duty to help correct it.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT