Re: Unicode FAQ addendum

From: Doug Ewell (dewell@compuserve.com)
Date: Thu Jul 20 2000 - 10:58:22 EDT


>| C1 says "A process shall interpret Unicode code values as 16-bit
>| quantities."

I think the focus here was supposed to be on the fact that Unicode code
values are *not 8-bit* quantities. I found out about Unicode in late
1991 when I discovered a copy of TUS 1.0 in a bookstore, and for years
afterward, when I read an article about Unicode there was sure to be
some angle presented that Unicode "broke" the C-language string model
by including "nulls," or zero bytes, in the character stream. Users of
single-byte and even multi-byte character sets had to overcome a major
mental block by realizing that the 16-bit word, not the byte, was the
atomic code unit. That was the biggest revolution in Unicode, and that
is probably why it was made Conformance Requirement Number One.

That said, I agree that the distinction between "code value" and
"character value," where the first still fits the 16-bit model but the
second does not, may be technically correct but feels a little like
wordplay. I wonder if the original intent (WORDS, not BYTES!) could
be preserved and the scalar value vs. code unit distinction addressed
at the same time. How about this:

"A process shall interpret Unicode character values as sequences of one
or two 16-bit quantities."

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT