the Unicode Cosortium and ISO treat their standards differently. They are
separate standards that are kept in sync, not one single standard.
ISO 10646 defines 32b code points and allows encodings with less capabilities
for subsets. They are going into the direction of only using planes 0 to 16 for
official assignments as those can be coded with UTF-16.
The Unicode Consortium defines 16b characters and, with the version 2.0 of the
Unicode Standard, includes the use of surrogate pairs to add the same range of
characters as with UTF-16. They are not going to use 32b characters. The
surrogate pair mechanism is exactly UTF-16.
This may be inconvenient when handling non-BMP characters, and it is
questionable for internal algorithms where UCS-4 makes things easier,
especially since you have to be prepared for several different file and stream
However, it probably makes sense for files as an easy and somewhat compact
format, and it makes sense for the number of possible characters: 1M + 64k,
including 128k+6400 private use character code points. There are about 38000
characters assigned so far, with about 20000-30000 more in the pipeline.
Implementations using UCS-4 are not conformant with Unicode, strictly speaking,
but I would not worry too much about it as long as you can read and write
"On systems where the native haracter type or wchar_t are implemented as 32-bit
quantities, an implementation may transiently use 32-bit quantities to
represent Unicode characters during processing. The internals of this
representation are treated as a black box and are not Unicode conformant. In
particular, any API or runtime library interfaces that accept strings of 32-bit
characters are not Unicode conformant. If such an implementation interchanges
16-bit Unicode characters with the outside world, then this interchange can be
conformant as long as the interface for this interchange complies with the
requirements of Chapter 3, Conformance."
from "The Unicode Standard, Version 2.0", 1996jul, Addison-Wesley, Chapter 5
"Implementation Guidelines", 5.1 "ANSI/ISO C wchar_t"
Markus Scherer IBM RTP +1 919 486 1135 Dept. Fax +1 919 254 6430
Edwin.Hart@jhuapl.edu on 98-11-09 09:34:13
Please respond to firstname.lastname@example.org
Subject: Displaying Plane 1 characters (annotating the code tables)
Can we resolve the issue of whether to display Unicode 3.0 code tables for
planes 01 to 10 by organizing the tables according to UCS-4 (including the
annotation for the rows and columns of the code tables) and perhaps
including the UTF-16 coding under the glyph for the character? If
sufficient room does not exist for the full UTF-16 notation, then perhaps
placing the 4 hex digits for the surrogate for the high-order 16-bits (Wyde
:-)) at the top of the table and the 4 hex digits for the surrogate for
low-order 16-bits under the character would work.
I have had the very distinct impression that Unicode had a preference for
"16-bit" encoding to the exclusion of the 4-octet UCS-4 form of ISO/IEC
10646-1:1993. I verified that UCS-4 appears to be excluded from the
conformance clause of Unicode 2.0. However, what if I chose to implement
the UCS-4 form of 10646 and implemented everything else but this in
conformance to the Unicode conformance statements (character properties,
bi-di, etc.), and my implementation was able to exchange data in the UCS-4
and UCS-2/UTF-16 forms? Is this implementation conformant to Unicode 2.0?
I interpret that this implementation is not conformant to Unicode 2.0. My
question is why should not the UCS-4 form be included as a conformant form
in Unicode? As we move to encoding characters into planes 1 and beyond, I
think that it makes good sense to add the UCS-4 form as conformant to
Edwin F. Hart
Applied Physics Laboratory
11100 Johns Hopkins Road
Laurel, MD 20723-6099
+1-240-228-6926 (from Washington, DC area)
+1-443-778-6926 (from Baltimore area)
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT