RE: What's in a wchar_t string on unix?

From: Rick Cameron (Rick.Cameron@businessobjects.com)
Date: Mon Mar 01 2004 - 16:59:06 EST

  • Next message: Peter Kirk: "Re: Tajik alphabet code"

    OK, I guess I need to be more precise in my question.
     
    For each of the popular unices (Solaris, HP-UX, AIX, and - if possible -
    linux), can anyone answer the following question:
     
    Assuming that the locale is set to Unicode, what is in a wchar_t string? Is
    it UTF-32 or pseudo-UTF-16 (i.e. UTF-16 code units, zero-extended to 32
    bits)?
     
    I'm not expecting that there's single answer for all the unices of interest.
    And I'm well aware that our application can store in a wchar_t [] whatever
    it wants. I'm trying to find out what the O/S expects to be in a wchar_t
    string.
     
    The reason we want to know this is that we want to be able to write a
    function that converts from UTF-8 (stored in a char []) to wchar_t []
    properly. Obviously the function may need to behave differently on different
    flavours of unix.
     
    I am aware of the utility functions offered by TUC to perform conversions
    between UTF-8, UTF-16 and UTF-32. These functions do not handle the case of
    pseudo-UTF-16; which doesn't surprise me, since AFAIK it's not a conformant
    encoding form. Nonetheless, I have a string suspicion that some unices may
    use it.
     
    Cheers
     
    - rick cameron

      _____

    From: Frank Yung-Fong Tang [mailto:ytang0648@aol.com]
    Sent: March 1, 2004 12:48
    To: Rick Cameron
    Cc: unicode@unicode.org
    Subject: Re: What's in a wchar_t string on unix?

    I

    Rick Cameron wrote on 3/1/2004, 2:13 PM:

    Hi, all

    This may be an FAQ, but I couldn't find the answer on unicode.org.

    The reason is there are "NO answer" to the question you ask.

    It seems that most flavours of unix define wchar_t to be 4 bytes.

    Depend on which UNIX and which version. Depend on how you define "most
    flavours"

    If the locale is set to be Unicode, what's in a wchar_t string?

    No answer for that because
    1) ANSI C standard does not define it. (neither it's size nor it's content)
    2) Several organization try to establish standard for Unix. One of that is
    "The Open Group"'s "Base Specifications" IEEE Std 1003.1, 2003. But neither
    that define what should wchar_t hold.

    Is it UTF-32, or UTF-16 with the code units zero-extended to 4 bytes?

    Cheers

    - rick cameron

    The more interesting question is, why do you need to know the answer of your
    question. And the ANSI/C wchar_t model basically suggest, if you ask that
    question, you are moving to a wrong direction....



    This archive was generated by hypermail 2.1.5 : Mon Mar 01 2004 - 17:29:01 EST