From: Karlsson Kent - keka [keka@im.se]
Sent: Friday, January 26, 2001 2:12 PM

Subject: Draft changes for wchar_t and Conformance sections

[Suggested new text at the end of this message]


> -----Original Message-----
> From: Sandra O'donnell USG [mailto:odonnell@zk3.dec.com]
> These are all optional types in C99, and I think many people

No, they are not quite that.  Clause says for <stdint.h>: "if a
implementation provides integer types with widths of 8, 16, 32, or 64 bits,
it **shall** define the corresponding typedef names." (my emphasis).  And
the typedef names with the word "least" in the names are **required** for
8, 16, 32, and 64 bit least widths (clause, irrespective of
architecture (though you may then (very rarely) find 9 or more bits in a
uint_least8_t value representation).

> (including
> me!) would object to recommending such encoding-specific types. The
> code I have that currently uses char and cleanly handles
> UTF-8, Latin-1,
> eucJP, etc., would have to be revised to special-case UTF-8 using the
> uint8_t type, for example.

Though "char" (or better "unsigned char") can handle multiple encodings,
by locale, still:

1. There is special-casing: if the locale is a "UTF-8 locale", some
   special functions can be used (like Markus Kuhn's wcwidth[_cjk]())
   which work only for UTF-8.

2. Even though some things work with "agnostic" datatypes like char
   and wchar_t, not everything can be controlled from POSIX locale
   data, for a more ambitious Unicode implementation. Which is why
   some nail wchar_t to UCS-2/UTF-16, and others have other names for
   Unicode specific datatypes.

3. There is one such recommendation there already: UNICHAR; it would
   be more neutral to have UTF8_t, UTF16_t, and UTF32_t.  I don't
   think it would be a good idea to have a UNICHAR which could sometimes
   be UTF-16, sometimes UTF-32, since those encodings need to be treated


> I know we have differing opinions about the desirability of
> encoding-specific
> types, but I think this is waaaaay too controversial to put
> into a small
> section of the Unicode standard that's trying to describe wchar_t.
>    . . .
>          There is only one suggested "typedefed" name: UNICHAR,
>    for UTF-16 *code units*, not *characters*.  Suggestion:
>    typedef uint8_t UTF8_t;
>    typedef uint16_t UTF16_t;
>    typedef uint32_t UTF32_t;
>    (I'm not really suggesting to write out those typedefs as
>    C code in the text, even though I did so here.)
> Again, I disagree. However, here I was trying simply to use
> as much of the
> existing text as possible. It refers to UNICHAR, so my
> revision did as well.
>          I'm not sure wchar_t can be used for UTF-16 and still
>    fully conform to C99.  I have very much the impression that
>    wchar_t may be UCS-2, or UCS-4, or whatever else can hold
>    any character in the coded character set in a *single* code
>    unit, but not UTF-8 code units, nor UTF-16 code units.
>    (I cannot quote the C standard, since I don't have my
>    (only, paper, FDIS) copy handy).
> I almost mentioned that in my original message, but thought it might
> bring the wrath of Microsoft/IBM down upon me. I agree that UTF-16 is
> not useable as a wchar_t encoding because wchar_t is a FIXED WIDTH
> encoding and UTF-16 clearly is not. UCS-2 is okay in wchar_t,
> of course.
> It's just that I know that it's a tricky subject for the vendors
> who chose 16-bits as their wchar_t. I mean, if wchar_t is 16 bits
> and wc happens to contain the first half of a surrogate pair, what
> are functions like iswalpha(), wcwidth(), wcstombs(), towupper(),
> etc., supposed to do?

Well, the C (and COBOL?) restrictions are broken anyway, as Mark says.
to[w]upper('') should return "SS", but it can't.  More examples in
SpecialCasing.txt.  I see no problem in MS having wchar_t be UTF-16
code units; even though it strictly speaking does not follow the C99

> This may be worth discussing, but I don't want to lose sight of the
> existing inaccuracy in R3.0 that says it is NOT Unicode-conformant to
> use a 32-bit wchar_t. I want to make sure that gets fixed.

I don't think it is all that a good idea to have UTF-32 for strings;
though it is useful for isolated characters, and interrogation functions
on isolated characters.  For toupper on an isolated character, it still
does not work properly.

> I'm just not
> sure we can have a productive discussion about UTF-16 as
> wchar_t; it tends
> to cause a lot of heat.
>               Regards,
>               -- Sandra
> -----------------------
> Sandra Martin O'Donnell
> Compaq Computer Corporation
> sandra.odonnell@compaq.com
> odonnell@zk3.dec.com

[The following is VERY brief; but I don't want to make it any longer
than the current 5.2. I've deleted a lot of old text that I find
tangential in such a short piece, or that I find to be moot.  I've
added some text about 'char' since that is what Linux folks prefer
for UTF-8, as well as more about typedefs and conditional
compilation.  I hope I have covered what Sandra wanted to cover,
even though I have made some rather thorough changes to Sandra's
suggested text.]

Suggested new text for 5.2:

5.2 Datatypes for Unicode

Unicode code units (that singly or in sequence represent
a Unicode character) need to be represented in some datatype
in programming languages.  Some programming languages may
have predefined types (or classes) also for Unicode strings.


The datatype 'char' in Java is for
representing UTF-16 code units, though initially only for
the BMP.  The datatype 'int' is sometimes used to represent
a UTF-16 code unit. [This may be a UTF-32 code unit later on...;
I don't know what the plans are for Java.]

A 'char' array, 'char[]', can be used to represent a null
terminated UTF-16 string.

The Java class String can also represent a Unicode string.


___C99 (ISO/IEC 9899:1999) and C++ (ISO/IEC 14882:1995)___

The datatype 'unsigned char' (or less stringently, 'char')
can be used for various byte oriented character encodings,
including multibyte character encodings like UTF-8. However,
functions such as 'isalpha' will work only where a character
can be represented in a single code unit, as do functions like
'toupper' that in addition will give a proper result only
if it fits as a single character in a single byte. The datatype
'int' is sometimes used to represent a value of type 'char'.

The datatype 'wchar_t' (in '<wchar.h>') can be used for various "wide"
character encodings; C and C++ leave the semantics of 'wchar_t' to the
specific implementation. 'wchar_t' may be for Unicode in some compilers,
e.g. for UTF-16 or for UTF-32.  The width of 'wchar_t' is compiler-specific
and can be as little as 8 bits, and even if wider it need not be Unicode.
Consequently, programs that need to be portable across any C or C++
compiler should not use 'wchar_t' for storing Unicode text. The datatype
'wint_t' is sometimes used to represent a value of type 'wchar_t'.

However, programmers can use (one or more) typedefs for Unicode
code units.  E.g., one can define 'UTF8_t' to be 'uint8_t', 'UTF16_t'
to be 'uint16_t', or 'UTF32_t' to be 'uint32_t'. The last one is
particularly useful for single code point property interrogation
functions.  'uint[N]_t' for N being 8, 16, 32, or 64 are defined in
'<stdint.h>' for C99 for all computer architectures that natively have
those data widths. Or one can transiently use the 'uint_least[N]_t' or
'uint_fast[N]_t' datatypes, that are provided in all C99 implementations.
Further, programmers can use conditional compilation to choose between
different 'typedef's for the same Unicode code unit name depending
on platform.


I agree with Sandra on the suggested changes to the conformance

                /kent k