While reading Unicode 3.0 book, I was a bit surprised that in chapter
3, still in version 3.0 the first conformance requirement is
"C1: A process shall interpret Unicode values as 16 bit quantities"
While this requirement certainly made sense in version 1, prior to
surrogates and UTF-8, but that statement does not make much sense
nowadays, unless for some strange reason UTF-16 encoding is favoured
among all other supported encodings.
It would be much more up to date to say
"C1: A process shall interpret Unicode values in the 0 .. 0x10FFFF
without any reference how these values are stored.
It should be up to the application to use one of the supported
encoding forms UTF-8, UTF-16 or UTF-32 (which unfortunately is not yet
defined in 3.0) to represent these values. This would also clarify the
standard, since all references to surrogates would be moved to the
chapter of the UTF-16 encoding, since surrogate pairs are really only
the oddity of the UTF-16 encoding.
Section 5.2 also discusses the C-language whar_t data type. In the
last paragraph 32 bit implementation of wchar_t is discussed. "In
particular, any API or runtime library interfaces that accept strings
of 32-bit characters are not Unicode-conformant."
Is this an artefact that UTF-32 was not defined when 3.0 was written,
since otherwise that implementation guideline does not make much
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT