Re: What's in a wchar_t string on unix?

From: Frank Yung-Fong Tang (
Date: Tue Mar 02 2004 - 11:54:47 EST

  • Next message: C J Fynn: "Re: Help needed ............."

    Antoine Leca wrote on 3/2/2004, 5:50 AM:

    > Rick Cameron asked:
    > > If the locale is set to be Unicode,
    > That part is highly suspect.
    > Since you write that, you already know the wchar_t encoding (as well
    > as char
    > one) depends on the locale setting.

    no, not true. the wchar_t is depend on the COMPILER and C LIB
    implementation, not depend on the locale setting.

    For example, wchar_t in MS Windows is defined by Microsoft (again MS is
    the one who define the compiler and C lib in that platform) as UCS2. And
    in windows, wchar_t always hold UCS2 regradless what locale you set. But
    that is because of the one who design the compiler and c lib define so.

    Also, in gnu's gcc and lib c implementation, wchar_t is defined to be 4
    bytes and always hold utf-32 regardless which locale you are set to. But
    again, that is defined by who wrote gcc and gnu version of lib c.

    It is compiler and c lib implementation depend.
    It is NOT locale dependent (unless a particular c lib implementaion
    define so)
    It is NOT application implementation depend.

    > Few person has this right. So you
    > then
    > also know that "wchar_t is implementation defined" in all the relevant
    > standards (ANSI, C99, POSIX, SUS). In other words, this says, answer
    > is in
    > the documentation for YOUR implementation.

    be careful here. The so called implementation in those standard refer to
    the implementation of C compiler and C library code. They are not refer
    to the application implementation.

    > Now, we can try to guess. But there are only guesses.
    > > what's in a wchar_t string? Is it UTF-32, or UTF-16 with the code units
    > zero-extended to 4 bytes?
    > The later is an heresy. Nobody should be fool enough to have this. UCS-2
    > with the code units zero-extended to 4 bytes might be an option, but if a
    > implementor has support for UTF-16, why would she store extended
    > UTF-16 (in
    > whatever form, i.e. split or joined, 4 or 8 bytes) in wchar_t? Any
    > evidence
    > of this would be a severe bug, IMHO.

    Again, the the C lib and C compiler implementation (again, not
    application implementation) are free to choose what they do. So they may
    choose to do whatever less possible for design so you won't able to
    guess it right.

    > Back to your original question, and assuming "the locale is set to be
    > Unicode", there is as much possibility to encounter UTF-32 values (which
    > would mean the implementation does have Unicode 3.1 support) than
    > zero-extended UCS-2 (case of a pre-3.1 Unicode implementation). Other
    > values
    > would be very strange, IMHO.

    Not strange at all if the developer for C lib and C compiler
    implementation intentionally want to make it opaque so no one can easily
      find out the answer and do the wrong thing. Of course, eventually
    people can still find it out. If I implement one today, I will probably
    do a UTF-32 xor with 0x1BADBEEF (hum... that may not work, since I may
    need to make sure ASCII 0x00 - 0x7f map to 0x00000000 - 0x0000007f- I
    think [not 100%] that is mandate by ANSI/C for wchar_t)

    > Recent standards has a test feature macro, __STDC_ISO_10646__, that if
    > defined will tell you the answer: defined to be greater than 1999xxL will
    > mean UTF-32 values. Defined but less than 1999xxL will probably mean no
    > surrogate support, hence zero-extended UCS-2. Undefined does not tell you
    > anything.
    > Unfortunately, this is also the most current setup.

    As long as you start to guess the value wchar_t, you are in the wrong
    path for the ANSI/C wchar_t.

    I don't like the fact how ANSI/C define wchar_t and definitely there are
    a need for a data type which hold the wide char and also let us know
    what the value mean, but for sure that data type is not wchar_t. It is
    wchar_t on Win32 only because MS add additional definitation to it.

    This archive was generated by hypermail 2.1.5 : Tue Mar 02 2004 - 12:30:29 EST