Re: What's in a wchar_t string on unix?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Mar 01 2004 - 16:10:38 EST

  • Next message: Philippe Verdy: "Re: Tajik alphabet code"

    What's in a wchar_t string on unix?What you'll put or find in wchar_t is
    application dependant. But there's only a guarantee to find a single code unit
    (not necessarily a codepoint) for characters encoded in the source and compiled
    with the appropriate source charset. But this charset is not necessarily
    Unicode.
    At run-time, functions in the standard libraries that work with or return wide
    strings only expect these strings to be encoded according to the current locale
    (not necessarily Unicode).
    So if you run your program in an environment where the locale is ISO-8859-2,
    you'll find code units whose value between 0 and 255 match their position in the
    ISO-8859-2 standard, but you won't find the corresponding character codepoints
    as defined by Unicode.
    A wchar_t can then be used with any charset whose minimum code unit size is
    lower than or equal to the size of the wchar_t type. This may be an Unicode
    encoding form, or any other encoding (except UTF-32 if wchar_t is defined as a
    16-bit integer type, which is not enough to represent every single Unicode
    codepoint).

    wchar_t is then only convenient for Unicode, as it is generally larger than
    char, but its presence does not mean it will support UTF-16 or UTF-32 (in ANSI
    C, wchar_t is allowed to represent the same type as char). So you'll still be
    platform dependant if you want to store a single character in a wchar_t
    variable. However a "wide" string constant (of type wchar_t*) should be able to
    store and represent any Unicode character or codepoint, possibly by mapping one
    codepoint to several wchar_t code units...

    Unlike Java's "char" type which is always an unsigned 16-bit integer on all
    platforms, there's no standard size for wchar_t in C and C++...

    ----- Original Message -----
    From: Rick Cameron
    To: unicode@unicode.org
    Sent: Monday, March 01, 2004 8:13 PM
    Subject: What's in a wchar_t string on unix?

    Hi, all
    This may be an FAQ, but I couldn't find the answer on unicode.org.
    It seems that most flavours of unix define wchar_t to be 4 bytes. If the locale
    is set to be Unicode, what's in a wchar_t string? Is it UTF-32, or UTF-16 with
    the code units zero-extended to 4 bytes?
    Cheers
    - rick cameron



    This archive was generated by hypermail 2.1.5 : Mon Mar 01 2004 - 16:53:22 EST