L2/01-058 From: Sandra O'donnell USG [odonnell@zk3.dec.com] Sent: Wednesday, January 24, 2001 2:08 PM Subject: Draft changes for wchar_t and Conformance sections I noted an omission in PDUTR #27 last week with respect to the existing wording about wchar_t. Ken invited me to submit proposed changes. They are attached. Comments welcome. -- Sandra ----------------------- Sandra Martin O'Donnell Compaq Computer Corporation sandra.odonnell@compaq.com odonnell@zk3.dec.com *********************************************************************** In Unicode R3.0: Section 5.2 ANSI/ISO C wchar_t (pp 107-108) With the wchar_t wide character type, ANSI/ISO C provides for the inclusion of fixed-width, wide characters. ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension. The Unicode characters in the ASCII range U+0020 to U+007E satisfy these conditions. Thus, if an implementation uses ASCII to code the portable C execution set, the use of the Unicode character set for the wchar_t type, with a width of 16 bits, fulfills the requirement. Change the last sentence to "...the use of the Unicode character set for the wchar_t type, in either UTF-16 or UTF-32 form, fulfills the requirement." The width of wchar_t is compiler-specific and can be as little as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers. However, some programmers can use a macro or typedef (for example, UNICHAR) that can be compiled as unsigned short or wchar_t depending on the target compiler and platform. This choice enables correct compilation on different platforms and compilers. Where a 16-bit implementation of wchar_t is guaranteed, such macros or typedefs may be predefined (for example, WCHAR on Win32 API). Change the sentence that begins "However, some programmers..." as follows: "However, programmers who want a UTF-16 implementation can use a macro of typedef (for example, UNICHAR) that might be compiled as unsigned short or wchar_t, depending on the target compiler and platform. Other programmers that want a UTF-32 implementation can use a macro or typedef which might be compiled as unsigned int or wchar_t, depending on the target compiler and platform." On systems where the native character type or wchar_t is implemented as a 32-bit quantity, an implementation may transiently use 32-bit quantities to represent Unicode characters during processing. The internal workings of this representation are treated as a black box and are not Unicode-conformant. In particular, any API or runtime library interfaces that accept strings of 32-bit characters are not Unicode-conformant. If such an implementation interchanges 16-bit Unicode characters with the outside world, then this interchange can be conformant as long as the interface for this interchange complies with the requirements of Chapter 3, Conformance. Revise this as follows: "On systems where the native character type or wchar_t is implemented as a 32-bit quantity, an implementation may use the UTF-32 form to represent Unicode characters." These changes also imply a change in Section 3.1 Conformance Requirements (beginning pg. 37). Item C1 currently says: "A process shall interpret the Unicode code values as 16-bit quantities. * Unicode values can be stored in native 16-bit machine words. * For information on the use of wchar_t or other programming language types to represent Unicode values, see Section 5.2, ANSI/ISO C wchar_t." Change this to: "Unicode defines scalar values that can be encoded in any of three transformation formats: UTF-8, UTF-16, or UTF-32. Remove the bullet item "Unicode values..." The last bullet item can remain as is.