RE: What's in a wchar_t string on unix?

From: Rick Cameron (Rick.Cameron@businessobjects.com)
Date: Thu Mar 04 2004 - 12:56:31 EST

  • Next message: Antoine Leca: "Re: What's in a wchar_t string ..."

    Woo-hoo! Finally, a real answer, rather than speculation.

    Thanks very much, Ienup.

    - rick

    -----Original Message-----
    From: Ienup Sung [mailto:is@mpkmail.eng.sun.com]
    Sent: March 4, 2004 9:53
    To: Rick Cameron
    Cc: unicode@unicode.org
    Subject: Re: What's in a wchar_t string on unix?

    Solaris Unicode/UTF-8 locales are using UTF-32 and we guarantee that it has
    been and will stay that way.

    Just in case, there are also a set of C std API such as mbtowc(),
    mbstowcs(), mbrtowc(), wctomb(), wcstombs(), wcrtomb(), and so on that will
    convert between wide character (UTF-32) and multibyte character (UTF-8)
    properly as long as you set the current locale to a Unicode/UTF-8 locale. If
    you wish to use non-locale sensitive function of conversion, you could use
    iconv() instead by openning the conversion descriptor with iconv_open() with
    "UTF-32" and "UTF-8" as fromcode and tocode (or vice versa). (A sample
    program example is available at iconv(3C) man page at Solaris by the way.)

    I'm also quite sure all major Unix/Linux systems support the functions that
    I mentioned. (I also believe majority will support UTF-32BE, UTF-32LE and
    such variations too in the iconv() code conversions by the way.)

    Additionally, since POSIX defines wchar_t as an opaque data type, we hope
    that people are using the std C interfaces to do conversions between wchar_t
    and multibyte characters if possible.

    With regards,

    Ienup

    ] From: Rick Cameron <Rick.Cameron@businessobjects.com>
    ] Subject: RE: What's in a wchar_t string on unix?
    ] Date: Mon, 1 Mar 2004 13:59:06 -0800
    ]
    ] OK, I guess I need to be more precise in my question.
    ]
    ] For each of the popular unices (Solaris, HP-UX, AIX, and - if possible - ]
    linux), can anyone answer the following question:
    ]
    ] Assuming that the locale is set to Unicode, what is in a wchar_t string?
    Is ] it UTF-32 or pseudo-UTF-16 (i.e. UTF-16 code units, zero-extended to 32
    ] bits)?
    ]
    ] I'm not expecting that there's single answer for all the unices of
    interest.
    ] And I'm well aware that our application can store in a wchar_t [] whatever
    ] it wants. I'm trying to find out what the O/S expects to be in a wchar_t ]
    string.
    ]
    ] The reason we want to know this is that we want to be able to write a ]
    function that converts from UTF-8 (stored in a char []) to wchar_t [] ]
    properly. Obviously the function may need to behave differently on different
    ] flavours of unix.
    ]
    ] I am aware of the utility functions offered by TUC to perform conversions
    ] between UTF-8, UTF-16 and UTF-32. These functions do not handle the case
    of ] pseudo-UTF-16; which doesn't surprise me, since AFAIK it's not a
    conformant ] encoding form. Nonetheless, I have a string suspicion that some
    unices may ] use it.
    ]
    ] Cheers
    ]
    ] - rick cameron



    This archive was generated by hypermail 2.1.5 : Thu Mar 04 2004 - 13:33:42 EST