Re: What's in a wchar_t string on unix?

From: Antoine Leca (Antoine10646@Leca-Marti.org)
Date: Tue Mar 02 2004 - 05:50:55 EST

  • Next message: Patrick Andries: "Re: LATIN SMALL LIGATURE CT"

    Rick Cameron asked:
    > It seems that most flavours of unix define wchar_t to be 4 bytes.

    As your "most" suggests, this is not universal. What if it is 8-byte? ;-)

    > If the locale is set to be Unicode,

    That part is highly suspect.
    Since you write that, you already know the wchar_t encoding (as well as char
    one) depends on the locale setting. Few person has this right. So you then
    also know that "wchar_t is implementation defined" in all the relevant
    standards (ANSI, C99, POSIX, SUS). In other words, this says, answer is in
    the documentation for YOUR implementation.

    Now, we can try to guess. But there are only guesses.

    > what's in a wchar_t string? Is it UTF-32, or UTF-16 with the code units
    zero-extended to 4 bytes?

    The later is an heresy. Nobody should be fool enough to have this. UCS-2
    with the code units zero-extended to 4 bytes might be an option, but if a
    implementor has support for UTF-16, why would she store extended UTF-16 (in
    whatever form, i.e. split or joined, 4 or 8 bytes) in wchar_t? Any evidence
    of this would be a severe bug, IMHO.

    Back to your original question, and assuming "the locale is set to be
    Unicode", there is as much possibility to encounter UTF-32 values (which
    would mean the implementation does have Unicode 3.1 support) than
    zero-extended UCS-2 (case of a pre-3.1 Unicode implementation). Other values
    would be very strange, IMHO.

    Recent standards has a test feature macro, __STDC_ISO_10646__, that if
    defined will tell you the answer: defined to be greater than 1999xxL will
    mean UTF-32 values. Defined but less than 1999xxL will probably mean no
    surrogate support, hence zero-extended UCS-2. Undefined does not tell you
    anything.
    Unfortunately, this is also the most current setup.

    Frank Yung-Fong Tang answered
    > The more interesting question is, why do you need to know the
    > answer of your question. And the ANSI/C wchar_t model basically
    > suggest, if you ask that question, you are moving to a wrong direction....

    I am not that sure. I agree that the wchar_t model is basically a dead end
    nowadays. But until the new model (char16_t, char32_t) get formalized and
    implementated, it is better than nothing, since implementers did try to have
    it right. Depending of your degree of conformance required, and also of the
    allowance you give to having to bring in something heavy (this could rule
    out ICU, for instance), then the minimalistic wchar_t support might help.

    Philippe Verdy wrote:
    > What's in a wchar_t string on unix?What you'll put or find in wchar_t
    > is application dependant.

    Disagreee. The result of mbtowc is NOT application dependant. It is rather
    implementation dependant, which might be rather more disturbing...

    > But there's only a guarantee to find a single
    > code unit (not necessarily a codepoint) for characters encoded in the
    > source and compiled with the appropriate source charset.

    Can't parse that.

    > But this charset is not necessarily Unicode.

    This, you know at the moment you are compiling (this is not the same of the
    result of using the library function, by the way).

    > At run-time, functions in the standard libraries that work with or
    > return wide strings only expect these strings to be encoded
    > according to the current locale (not necessarily Unicode).
    > So if you run your program in an environment where the locale is
    > ISO-8859-2,

    ... you are answering something completely opposed from what he asked, since
    it specified

    : > If the locale is set to be Unicode,

    > you'll find code units whose value between 0 and 255 match their
    > position in the ISO-8859-2 standard,

    That is wrong. When "your locale is ISO-8859-2" (whatever that may really
    meant), you know next to nothing to encoding used for wchar_t. It might be
    ISO-8859-2 (case of the degenerate case when wchar_t == char), it might be
    Unicode (best probability on Unix if wchar_t is 4 bytes), or it might even
    something very different like a flat EUC-XX (on some East-Asian flavour of
    Unix). Only thing you know for sure, it is not EBCDIC!

    > A wchar_t can then be used with any charset whose minimum code unit size
    is
    > lower than or equal to the size of the wchar_t type.

    Wrong again. "any" is too strong. There are many charsets that while being
    "smaller" than some other, cannot be shoe-horsed to enter into th encopding
    of the wider form. For example, is wchar_t is 2 bytes and hold values
    according to EUC-JP, you cannot encode Big-5 or ISCII with it, even if the
    minimum code size is equal or even less: this is because all needed
    codepoints are not defined in EUC-JP.

    Unicode among its properties, does have the one to encompass all existing
    charsets, so it aims at satisfying the property you spelled. But the mere
    fact it is an objective of Unicode should show that all other existing
    charsets do not satisfy the property.

    > wchar_t is then only convenient for Unicode,

    I cannot see from what you are inferring this.

    > However a "wide" string constant (of type wchar_t*) should be able
    > to store and represent any Unicode character or codepoint,
    > possibly by mapping one codepoint to several wchar_t code units...

    This is specifically prohibited.
    The very point of wchar_t was to avoid the multibyte stuff. So if you
    support Unicode 3.1 (surrogates), you are required to have 21-bits or more
    wchar_t. 16-bits wchar_t limit you ipso facto to 3.0 support.
    I confirmed this various times with the C comittee, because I wanted if any
    possible to qualify existing 16-bit wchar_t implementations to make them
    able to use the __STDC_ISO_10646__ feature (to indicate e.g. Philippine
    script support). The comittee made very clear this is not possible.

    > Unlike Java's "char" type which is always an unsigned 16-bit integer
    > on all platforms, there's no standard size for wchar_t in C and C++...

    After all, this is correct.

    Rick Cameron then wrote:
    > OK, I guess I need to be more precise in my question.
    > For each of the popular unices (Solaris, HP-UX, AIX, and - if
    > possible - linux), can anyone answer the following question:
    >
    > Assuming that the locale is set to Unicode

    What do you want to say with "locale is set to Unicode":
      setlocale(LC_ALL, "Unicode");
        result is garbage, for all I know

      setlocale(LC_CTYPE, "qq_XX.utf8");

      something else?

    Does it includes special enabling Japanese and Chinese versions? (that may
    use some EUC encoding for wchar_t, in order to ease some compatibility)

    And of course, this highly depends on the release number (of libc, mainly).

    >, what is in a wchar_t string? Is it UTF-32 or pseudo-UTF-16
    > (i.e. UTF-16 code units, zero-extended to 32 bits)?

    See above about pseudo UTF 16.

    > I'm trying to find out what the O/S expects to be in a
    > wchar_t string.

    By the way, a Unix OS does not expects anything in wchar_t[]. It does not
    care of them, in any single point I can thinking about.
    There are libc fonctions that do process these: the mb*towc*/wc*tomb*
    series, the wcs* serie, the w*/f*ws versions of the <stdio.h> functions,
    some features of classic printf and scanf. But nothing of this (as opposite
    to Windows NT) is passed down to the OS, at least without a possibility to
    inspect the result.

    > The reason we want to know this is that we want to be able to write a
    > function that converts from UTF-8 (stored in a char []) to wchar_t []
    > properly. Obviously the function may need to behave differently on
    > different flavours of unix.

    OK, thanks for explaining your problem.
    Basically, if wchar_t encode UTF-32, you are free of any problem. Clearly,
    this is (should be) what all current releases will do. So your problem is
    how can you handle these old versions (which ones) that does not know about
    surrogates, and will expect a surrogate pair to be stored in two wchar_t
    cells. And then will handle this "correctly", as much as this may mean
    anything.

    Did I reformulate your question correctly?

    A way to see this is, what happens to some old Unix (or anything else) when
    feeded with plane 1 characters? I would say (assume it is not right
    brocken), well, nothing special: before Unicode 3.1, the standard was ISO
    10646 31-bit form, which says every value until 0x7FFFFFF may be used, and
    even says that the greater values may indeed be used (for private use: this
    is by the way the biggest incompatibility of introducing the limitation to
    U+10FFFD). So your >0xFFFF values should be handled correctly by the OS,
    which will not do anything special. In particular, of course, it will not
    print them, since it does not have any clue about such characters, whatever
    the encoding used!

    So I think, the bottom line is, who cares of the encoding for upper planes?
    (provided it *is* Unicode for the lower groups: as you should see, this is
    difficult to say for sure)

    On the other hand, encoding surrogate characters as two wchar_t is very very
    likely to bring you a lot of problems, for no real benefit I can envision.
    Furthermore, it only matters for old platforms that are fading away, hence
    there are maintenance difficulties to add.

    Go on, encode as UTF-32, whatever libc really expected. Ultimately, the only
    one that will use the datas will be you, anyway!

    Hope it helps,

    Antoine

    PS: if you write the UTF-8 to UTF-32 decoder, you should also write the
    reverse encoder: to left the OS doing the coding back to UTF-8 won't give
    you useful results.



    This archive was generated by hypermail 2.1.5 : Tue Mar 02 2004 - 06:28:59 EST