L2/01-058

From: Sandra O'donnell USG [odonnell@zk3.dec.com]
Sent: Wednesday, January 24, 2001 2:08 PM

Subject: Draft changes for wchar_t and Conformance sections

I noted an omission in PDUTR #27 last week with respect to the existing
wording about wchar_t. Ken invited me to submit proposed changes. They
are attached. Comments welcome.

		-- Sandra
-----------------------
Sandra Martin O'Donnell
Compaq Computer Corporation
sandra.odonnell@compaq.com
odonnell@zk3.dec.com

***********************************************************************
In Unicode R3.0:

Section 5.2 ANSI/ISO C wchar_t (pp 107-108)

   With the wchar_t wide character type, ANSI/ISO C provides for the inclusion
   of fixed-width, wide characters. ANSI/ISO C leaves the semantics of the
   wide character set to the specific implementation but requires that the
   characters from the portable C execution set correspond to their wide
   character equivalents by zero extension. The Unicode characters in the
   ASCII range U+0020 to U+007E satisfy these conditions. Thus, if an
   implementation uses ASCII to code the portable C execution set, the use
   of the Unicode character set for the wchar_t type, with a width of 16 bits,
   fulfills the requirement.

Change the last sentence to "...the use of the Unicode character set for
the wchar_t type, in either UTF-16 or UTF-32 form, fulfills the requirement."
   
   The width of wchar_t is compiler-specific and can be as little as 8 bits.
   Consequently, programs that need to be portable across any C or C++ compiler
   should not use wchar_t for storing Unicode text. The wchar_t type is
   intended for storing compiler-defined wide characters, which may be Unicode
   characters in some compilers. However, some programmers can use a macro or
   typedef (for example, UNICHAR) that can be compiled as unsigned short or
   wchar_t depending on the target compiler and platform. This choice enables
   correct compilation on different platforms and compilers. Where a 16-bit
   implementation of wchar_t is guaranteed, such macros or typedefs may be
   predefined (for example, WCHAR on Win32 API).

Change the sentence that begins "However, some programmers..." as follows:
"However, programmers who want a UTF-16 implementation can use a macro
of typedef (for example, UNICHAR) that might be compiled as unsigned
short or wchar_t, depending on the target compiler and platform. Other
programmers that want a UTF-32 implementation can use a macro or
typedef which might be compiled as unsigned int or wchar_t, depending on
the target compiler and platform."
   
   On systems where the native character type or wchar_t is implemented
   as a 32-bit quantity, an implementation may transiently use 32-bit
   quantities to represent Unicode characters during processing. The internal
   workings of this representation are treated as a black box and are not
   Unicode-conformant. In particular, any API or runtime library interfaces
   that accept strings of 32-bit characters are not Unicode-conformant. If
   such an implementation interchanges 16-bit Unicode characters with the
   outside world, then this interchange can be conformant as long as the
   interface for this interchange complies with the requirements of Chapter 3,
   Conformance.

Revise this as follows: "On systems where the native character type or
wchar_t is implemented as a 32-bit quantity, an implementation may use
the UTF-32 form to represent Unicode characters." 


These changes also imply a change in Section 3.1 Conformance Requirements
(beginning pg. 37). Item C1 currently says:

   "A process shall interpret the Unicode code values as 16-bit quantities.
   
   * Unicode values can be stored in native 16-bit machine words.
   
   * For information on the use of wchar_t or other programming language
   types to represent Unicode values, see Section 5.2, ANSI/ISO C wchar_t."

Change this to:

"Unicode defines scalar values that can be encoded in any of three
transformation formats: UTF-8, UTF-16, or UTF-32.

Remove the bullet item "Unicode values..." The last bullet item can
remain as is.