Re: 31 or 32 bit?

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jan 13 1998 - 18:48:01 EST


Carl-Martin Bunz asked:

>
> The problem is the following: What is ISO 10646 - a 31-bit encoding
> containing 2.147.483.648 code positions (as presented in the Unicode
> 2.0 Book p. C-2) or a 32-bit encoding with 4.294.967.296 code points
> (as explained elsewhere, even if its architecture is described as
> '128 groups of 256 planes', which means 2.147.483.648 cells)?
>
> How is the four-octet encoding ISO/IEC 10646 described correctly, i.e.
> in precise terms? In case it is a 31-bit encoding, what is the one
> bit used or reserved for?
>
> Thank you very much for clarifying this simple question.
>

ISO/IEC 10646 declares in its scope clause that it:

"- specifies the four-octet (32-bit) canonical form of the UCS: UCS-4;
 - specifes a two-octet (16-bit) BMP form of the UCS: UCS-2;..."

Clearly, four octets do constitute 32-bits.

But...

Clause 5 "General structure of the UCS" goes on to state:

"The value of any octet is expressed in hexadecimal notation from
00 to FF in ISO/IEC 10646 ... The canonical form of this coded
characters set (the way in which it is to be conceived) uses a
four-dimensional coding space, regarded as a single entity,
consisting of 128 three-dimensional groups.
  NOTE - Thus, bit 8 of the most significant octet in the
  canonical form of a coded character can be used for internal
  processing purposes within a device as long as it is set to zero
  within a conforming CC-data-element."

Thus, if the 32-bits of the four-octets are serialized as bits 1..32,
bit 32 is basically available for internal processing, is not
transmissible, and is not available as part of the encoding space.

Effectively, that means that ISO/IEC 10646 is a 31-bit encoding
space (0..2,147,483,647) with a 32-bit encoding form (UCS-4).

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT