L2/02-427

WG14 Document: N992
Date: 2002-11-15


Additional Character Data Types in the Programming Language C

WG14 meeting in April 2002 discussed document N969 , and in October 2002 discussed document N977 .  During the discussion, the following basic criteria were considered to be important when forming an outline of further discussions on additional character data types:

  1. WG14 has a request from US, Germany, UTC, and SC22 to consider UTF-16 and UTF-32 support.  (N966)
  2. It is essential that new data types guarantee certain width because of the portability.
  3. It is desirable that additional character data types are as generic as possible.  Although we currently have requests for UTF-16 and UTF-32 support, the new data types must cover, in principal, other encodings.
  4. String literals need to be specified for the new data types. 
  5. It is desirable that the encoding of the new data types is implementation independent.

There is a consensus to call the new data type char16_t and char32_t.  The names suggest that the width of the new data types are well defined; the encoding of those data types is implementation-defined.

1. How to specify the string literals for new data types

1.1 Simple approach with a prefix for literals

Using a one-letter prefix, similar to the notation L"str" for wide string literals,

  u"str"

The literal is used to initialize an array of char16_t. The corresponding character constants are

  u'c'

and have the type char16_t.

This proposal covers a 32-bit type, using char32_t ,  U"str" and U'c'.

2. How to specify the Encoding of new data types

C99 subclause 6.10.8 specifies that the value of the macro __STDC_ISO_10646__ shall be "an integer constant of the form yyyymmL (for example, 199712L), intended to indicate that values of type wchar_t are the coded representations of the characters defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month." C99 subclause 6.4.5p5 specifies that wide string literals are initialized with a sequence of wide characters as defined by the mbstowcs function with an implementation-defined current locale.

There shall be a macro __STDC_UTF_16__ (or similar) to indicate that char16_t uses UTF-16. This also allows the use of UTF-16 in char16_t even if wchar_t uses a non-Unicode encoding. In certain cases the compile-time conversion to UTF-16 may be restricted to members of the basic character set and universal character names (\Unnnnnnnn and \unnnn) because for these the conversion to UTF-16 is defined unambiguously. 

The encoding of char32_t can be defined in the same manner using __STDC_UTF_32__. 

The encoding of new data types and string literals become implementation defined when the macro __STDC_ UTF_nn __  is not set. 

The new string literal formats (u”str” and U”str”) should follow the same catenation rules as the existing L”str” strings; i.e., when adjacent literals of the same format are catenated, also if one of the adjacent literals is a “narrow” string, the result is widened to the representation of the other string literal.  Here some examples

            u”a u”b à  u”ab               U”a U”b  à  U”ab                        L”a L”b   à  L”ab 

            u”a ”b”   à  u”ab                 U”a ”b”   à  U”ab                           L”a ”b”  à  L”ab   

            ”a”  u”b   à  u”ab                 ”a”  U”b   à  U”ab                           ”a”  L”b  à  L”ab   

Any other catenations are implementation-defined (they might or might not be supported).

 

 


Last modified: Wed Nov 13 2002