RE: What's in a wchar_t string ...

From: Winkler, Arnold F (Arnold.Winkler@unisys.com)
Date: Thu Mar 04 2004 - 08:21:46 EST

  • Next message: Rick Cameron: "RE: What's in a wchar_t string on unix?"

    Folks,

    Since "ISO/IEC 9899 - Programming Language C" was quoted, I wonder if
    you are aware of the efforts of SC22/WG14 to develop a Technical Report
    that deals with the problems discussed in this thread.

    The document is ISO/IEC DTR 19769 - Extensions for the programming
    language C to support new character data types

    The project is currently in DTR ballot and will, when approved,
    certainly take some time to be implemented in C-compilers and in
    operating systems. But it gives a good indication, in which direction
    the formal standardization is going with data types in C language.

    Here are some excerpts from the DTR 19769:

    Quote:
    3 The new typedefs
    This Technical Report introduces the following two new typedefs,
    char16_t and
    char32_t :
    typedef T1 char16_t;
    typedef T2 char32_t;
    where T1 has the same type as uint_least16_t and T2 has the same type as
    uint_least32_t.
    The new typedefs guarantee certain widths for the data types, whereas
    the width of
    wchar_t is implementation defined. The data values are unsigned, while
    char and
    wchar_t could take signed values.
    This Technical Report also introduces the new header:
    <uchar.h>
    The new typedefs, char16_t and char32_t, are defined in <uchar.h>

    4 Encoding
    C99 subclause 6.10.8 specifies that the value of the macro _
    _STDC_ISO_10646_ _
    shall be "an integer constant of the form yyyymmL (for example,
    199712L), intended
    to indicate that values of type wchar_t are the coded representations of
    the
    characters defined by ISO/IEC 10646, along with all amendments and
    technical
    corrigenda as of the specified year and month." C99 subclause 6.4.5p5
    specifies that wide string literals are initialized with a sequence of
    wide characters as defined by the mbstowcs function with an
    implementation-defined current locale. Analogous to this macro, this
    Technical Report introduces two new macros.

    If the header <uchar.h> defines the macro _ _STDC_UTF_16_ _, values of
    type
    char16_t shall have UTF-16 encoding. This allows the use of UTF-16 in
    char16_t
    even when wchar_t uses a non-Unicode encoding. In certain cases the
    compile-time
    conversion to UTF-16 may be restricted to members of the basic character
    set and
    universal character names (\Unnnnnnnn and \unnnn) because for these the
    conversion
    to UTF-16 is defined unambiguously.

    If the header <uchar.h> defines the macro _ _STDC_UTF_32_ _, values of
    type
    char32_t shall have UTF-32 encoding.

    If the header <uchar.h> does not define the macro _ _STDC_UTF_16_ _, the
    encoding of char16_t is implementation defined. Similarly, if the header
    <uchar.h> does not define the macro _ _STDC_UTF_32_ _, the encoding of
    char32_t is implementation defined.

    An implementation may define other macros to indicate a different
    encoding.
    Unquote

    The document, which of course is copyrighted by ISO, starts with a nice
    introduction that defines the problem. In addition to the excerpts
    above, it also addresses the following subjects:
    5 String literals and character constants
    5.1 String literals and character constants notations
    5.2 The string concatenation
    6 Library functions
    6.1 The mbrtoc16 function
    6.2 The c16rtomb function
    6.3 The mbrtoc32 function
    6.4 The c32rtomb function
    7 ANNEX A Unicode encoding forms: UTF-16, UTF-32

    Best regards
    Arnold

    -----Original Message-----
    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
    Behalf Of Nelson H. F. Beebe
    Sent: Wednesday, March 03, 2004 1:49 PM
    To: unicode@unicode.org
    Cc: beebe@math.utah.edu
    Subject: Re: What's in a wchar_t string ...

    "Frank Yung-Fong Tang" <ytang0648@aol.com> asks on Wed, 3 Mar 2004
    12:38:49
    -0500:

    >> Does it also mean wchar_t is 4 bytes if __STDC_ISO_10646__ is
    defined?
    >> or does it only mean wchar_t hold the character in ISO_10646
    >> (which mean it could be 2 bytes, 4 bytes or more than that?)

    Here is the exact text from

            INTERNATIONAL ISO/IEC STANDARD 9899
            Second edition
            1999-12-01
            Programming languages -- C

    >> ...
    >> __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
    >> example, 199712L), intended to indicate
    >> that values of type wchar_t are the coded
    >> representations of the characters defined
    >> by ISO/IEC 10646, along with all amendments
    >> and technical corrigenda as of the
    >> specified year and month.
    >> ...

    It says nothing more about the size of wchar_t, or what encodings are
    used: note the vague language "coded representations...". This means
    effectively that the implementation, not the Standard, decides.

    Very few current Unix C or C++ compilers even define the symbol
    __STDC_ISO_10646__; the C/C++ feature test package at

            ftp://ftp.math.utah.edu/pub/features
            http://www.math.utah.edu/pub/features

    probes that macro value, and many others.

    My logs of its runs in about 90 build environments show definitions
    with values 200009 for GNU gcc versions 3.x (all platforms), Intel icc
    versions 7.x and 8.0 (Intel IA-32 and IA-64), and Portland Group pgcc
    versions 4.x and 5.x (Intel IA-32). On all of these, it reports that
    sizeof(wchar_t) = 4, but of course, that says nothing whatever about
    the encoding.

    ------------------------------------------------------------------------
    -------
    - Nelson H. F. Beebe Tel: +1 801 581 5254
    -
    - University of Utah FAX: +1 801 581 4148
    -
    - Department of Mathematics, 110 LCB Internet e-mail:
    beebe@math.utah.edu -
    - 155 S 1400 E RM 233 beebe@acm.org
    beebe@computer.org -
    - Salt Lake City, UT 84112-0090, USA URL:
    http://www.math.utah.edu/~beebe -
    ------------------------------------------------------------------------
    -------



    This archive was generated by hypermail 2.1.5 : Thu Mar 04 2004 - 12:36:46 EST