Paul Keinšnen asked:
> While reading Unicode 3.0 book, I was a bit surprised that in chapter
> 3, still in version 3.0 the first conformance requirement is
> "C1: A process shall interpret Unicode values as 16 bit quantities"
> While this requirement certainly made sense in version 1, prior to
> surrogates and UTF-8, but that statement does not make much sense
> nowadays, unless for some strange reason UTF-16 encoding is favoured
> among all other supported encodings.
It is not a "strange reason", but simple continuity with Unicode 2.0.
Unicode was defined to *be* UTF-16 in Unicode 2.0. This continues to
be the case, generically, in Unicode 3.0, but UTF-8 has also been
defined to be a conformant encoding form for the Unicode Standard.
The mandate given the editorial committee by the UTC was to ensure that
both UTF-8 and UTF-16 were conformant encoding forms. UTF-8 was written
into the conformance chapter in such a way as to maximize the continuity
with the existing statements in Unicode 2.0 and to minimize the reworking
In particular, regarding C1, the key is found in D5:
"... the code values used in the UTF-16 form of the Unicode Standard
are 16-bit units. These 16-bit code values are also known simply
as Unicode values."
Thus the term "Unicode value" is defined to be a 16-bit code value, and
C1 stands unchanged.
> It would be much more up to date to say
> "C1: A process shall interpret Unicode values in the 0 .. 0x10FFFF
> without any reference how these values are stored.
That would be possible if C1 referred to "Unicode scalar values" (cf.
D28), but it does not.
Keep in mind the history of C1. In concept, it goes back to the statement
on p. 10 of The Unicode Standard, Version 1.0: "All Unicode characters
have a uniform width of 16 bits." The introduction of UTF-16 changed
that, of course, but the formal statement of conformance in The Unicode
Standard, Version 2.0, still was intended to emphasize the fact that
the "thing" you manipulated in Unicode was a 16-bit unit (and *not* an
8-bit octet, as for all other character encoding standards). That was
and is still an important distinction that has to be repeated to
people who try to pump Unicode values through jury-rigged 8-bit or
> It should be up to the application to use one of the supported
> encoding forms UTF-8, UTF-16 or UTF-32 (which unfortunately is not yet
> defined in 3.0) to represent these values. This would also clarify the
> standard, since all references to surrogates would be moved to the
> chapter of the UTF-16 encoding, since surrogate pairs are really only
> the oddity of the UTF-16 encoding.
It is quite likely that UTF-32 will be approved soon. Currently it is
still at Draft status, and thus could not be written into Unicode 3.0
(whose technical content closed nearly a year ago).
Once UTF-32 is on the books, it may indeed be possible for conformance
for the Unicode Standard to be rewritten to be stated in terms of
Unicode scalar values and 3 separate but co-equal encoding forms. But
don't expect that to happen until the *next* major version of the standard,
since it would require a substantial revision to normative text in
> Section 5.2 also discusses the C-language whar_t data type. In the
> last paragraph 32 bit implementation of wchar_t is discussed. "In
> particular, any API or runtime library interfaces that accept strings
> of 32-bit characters are not Unicode-conformant."
> Is this an artefact that UTF-32 was not defined when 3.0 was written,
> since otherwise that implementation guideline does not make much
It was and still is technically accurate, until such time as the UTC
formally approves UTR #19 and decides to make it a part of the
We anticipate that wchar_t guidelines will be rewritten at the point
when UTF-32 is formally adopted as an encoding form for the Unicode
> Paul Keinšnen
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT