Re: What's in a wchar_t string on unix?

From: Frank Yung-Fong Tang (ytang0648@aol.com)
Date: Tue Mar 02 2004 - 11:33:05 EST

  • Next message: Frank Yung-Fong Tang: "Re: What's in a wchar_t string on unix?"

    Philippe Verdy wrote on 3/1/2004, 4:10 PM:

    > What's in a wchar_t string on unix?What you'll put or find in wchar_t is
    > application dependant.

    Absolutely not true. wchar_t is COMPILER and C LIB implementation
    dependent, not "applicaton dependant".
    Why it is COMPILER dependent? It is because the ANSI/C syntax L"string"
    need to be convert into wchar_t* by the compiler.
    Why it is C LIB implementation dependent? It is because the C LIB
    implementation need to know how to handle those wchar_t inside those
    standard ANSI/C mbtowc or mbstowcs routines.

    It is NOT "application dependant"!!!

    > But there's only a guarantee to find a single
    > code unit
    > (not necessarily a codepoint) for characters encoded in the source and
    > compiled
    > with the appropriate source charset. But this charset is not necessarily
    > Unicode.
    > At run-time, functions in the standard libraries that work with or
    > return wide
    > strings only expect these strings to be encoded according to the
    > current locale
    > (not necessarily Unicode).

    How to stuff the locale encoding into a wchar_t is also necessary
    straight forward. I once defined a algorithm to stuff 7 planes (two
    bytes each, range from 0x2121-0x7e7e) of CNS 11643 into a 2 bytes
    wchar_t ( 94 x 94 x 7 = 61852 < 2^16 = 65536) while I work for III on
    UNIX Traditional Chinese support on SVR4. In that case, what stored in
    wchar_t is neither Unicode, nor euc_tw but some code sequence agree
    between mbtowc and wctomb.

    > So if you run your program in an environment where the locale is
    > ISO-8859-2,
    > you'll find code units whose value between 0 and 255 match their
    > position in the
    > ISO-8859-2 standard,

    That may be true by a specific implementation of a specific version. But
    that is not even necessary true for all implementation.

    > but you won't find the corresponding character
    > codepoints
    > as defined by Unicode.
    > A wchar_t can then be used with any charset whose minimum code unit
    > size is
    > lower than or equal to the size of the wchar_t type. This may be an
    > Unicode
    > encoding form, or any other encoding (except UTF-32 if wchar_t is
    > defined as a
    > 16-bit integer type, which is not enough to represent every single
    > Unicode
    > codepoint).

    > wchar_t is then only convenient for Unicode, as it is generally larger
    > than
    > char,

    100% disagree with the above statement. In fact, wchar_t is NOT
    origionally designed with Unicode at all. It is mainly designed for
    handling the iteration of multibyte characters set locale (Shift_JIS,
    euc_jp, euc_tw, gb2312, euc_kr, etc) easier.

    > but its presence does not mean it will support UTF-16 or UTF-32
    > (in ANSI
    > C, wchar_t is allowed to represent the same type as char). [...]
    Same "size" as char, not same "type" as char.

    > Unlike Java's "char" type which is always an unsigned 16-bit integer
    > on all
    > platforms, there's no standard size for wchar_t in C and C++...

    Agree.



    This archive was generated by hypermail 2.1.5 : Tue Mar 02 2004 - 12:01:39 EST