wchar_t (was RE: 32'nd bit & UTF-8)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Jan 22 2005 - 07:38:51 CST

  • Next message: Lars Kristan: "RE: Conformance (was UTF, BOM, etc)"

    Arcane Jill wrote:
    > and Microsoft Wisual C++, which fixes wchar_t to SIXTEEN bits.
    >
    > The existence of wchar_t does not imply UTF-32. It does /not/ imply
    > UTF-16. It does
    > not even imply Unicode. It's just a type.

    Very well put (except for the typo which I have taken the liberty of
    correcting in the quote).

    And as whcar_t was mentioned, I realized it has a lot to do with the text vs
    binary data distinction. But let me start at the beginning.

    What is wchar_t? Yes, it is a Unicode related type. It does not imply
    Unicode. But nor does its absence imply no Unicode. What is it then? It is
    the type that is used in the implementation (or rather interface) of the
    basic Unicode functions in a compiler (though possibly related to system
    API).

    By declaring a single (and rather loose) type, two things have happened:
    * Due to different implementations the source code became less portable.
    * A notion was created that only a single implementation if the basic
    functions is needed.

    A single implementation approach is used often. It is because it modularizes
    things and because it is often natural and efficient. But not always. An
    analogy of the contrary would be a graphic library that would only define a
    PutPixel function, claiming it suffices. Mathematically, yes, but it is far
    from efficient (in terms of performance) and far from user (programmer)
    friendly.

    Back to wchar_t. Let's introduce wchar32_t. Most of Unicode functions can be
    implemented using that type. But it may also be useful to define some of
    those functions for UTF-8 strings. Do we need a new type for that? In C, one
    would get away with the char type, but for C++ it would be useful to
    introduce the wchar8_t type. Now notice that while you can implement some
    functions for wchar32_t type with characters, the same function for wchar8_t
    type must (well, should) operate on strings:
    BOOL isspace(wchar32_t), but BOOL isspace(wchar8_t *).
    Where I am deliberately abandoning the int and wint_t types typically used
    in such functions.

    The shift from charater to sting is very useful. For example:
    wchar32_t * strchr(wchar32_t *, wchar32_t), but wchar8_t * strchr(wchar8_t
    *, wchar8_t *).

    The wchar8_t * strchr(wchar8_t *, wchar8_t *) is close to wchar8_t *
    strstr(wchar8_t *, wchar8_t *). Except that strchr should tolerate overlong
    strings in the second parameter, meaning it would observe only the first
    codepoint and would not require that it be nul terminated (and the same goes
    for the string version of isspace).

    An implementation of wchar32_t * strchr(wchar32_t *, wchar32_t *) is also
    useful once you realize you also have the generic wchar_t type and want to
    write generic code with as little impact as possible. With the wint_t type,
    you constantly need to transform from strings to wint_t, and you'll keep
    doing it even when it is not necessary. Or will add extra code to avoid it.

    Another reason is that some functions cannot be implemented using the
    character input, even with wchar32_t. Outputting for display is just one of
    them. I am not sure about collation, I'll leave that to the experts.

    Of course then you have the wchar16_t. Windows. Here BOOL isspace(wchar16_t)
    is actually int iswspace(wint_t), where wint_t is 16 bit. Windows is UCS-2.
    The way to extend it is to introduce BOOL isspace(wchar16_t *), possibly as
    int isspace16(wchar16_t *). Since on Windows wchar_t equals wchar16_t, you
    can use the isspace16 with pointers to native wchar_t strings.

    I've been assuming overloading (so C++) in the BOOL functions. Consider it
    pseudo code. For non-overloading (i.e. C) examples, naming convention issues
    arise. Especially where wint_t functions would get the wchar_t counterparts.
    I'd even stick with the latter and not even implement the character based
    functions. Thus reducing the amount of functions.

    What remains is the definition of the wcharNN_t types. The names suggest
    actual size, but it is probably simply the smallest allowed size. But
    typically they will be equal.

    Would every run time library be required to implement functions for all
    three types? Probably not. One would suffice, with wchar_t being equal to
    its native implementation type. Adding other types would increase the
    portability of the platform. Alas, writing wrappers would be fairly simple,
    I suppose.

    And, finally, to get back to the text vs binary distinction. On UNIX,
    (wchar8_t *) would equal (char *). Meaning no distinction, no conversion. At
    least by default.

    On Windows, conversion from (char_t *) to (wchar8_t *) would imply ACP based
    conversion. In C++ this could be an overloaded type conversion. But you
    could disable that (or simply cast) and get the UNIX behavior, should you
    need that.

    What is interesting is that you can do that for the (char_t *) / (wchar8_t
    *) pair. You however MUST convert between (char_t *) and the other two types
    (16 and 32). And in this case, you will lose invalid sequences. This makes
    the (wchar8_t *) based processing the most robust and the only useful
    alternative where such behavior is needed.

    One problem is that even in (wchar8_t *) based processing, one might find
    out that the (wchar8_t *) functions are just wrappers to (wchar32_t *)
    functions. In such cases, in order to retain the full power of (wchar8_t *)
    processing, one would need to add extra code to alleviate that problem. Not
    impossible, but tedious, and error prone. But not always needed,
    fortunately.

    The other problem is that (wchar8_t *) based processing might not be
    possible, for example if a platform does not provide even the (wchar8_t *)
    wrappers. Which might be the case with Windows. Of course you can write the
    wrappers yourself, perhaps find third party wrappers. But there could be an
    incurred cost if you need to constantly convert from UTF-8 (wchar8_t *) each
    time you want to call system APIs.

    Both problems can be solved with one simple change. By introducing 128
    codepoints to allow the roundtrip of invalid sequences in UTF-8. Then the
    (wchar8_t *) wrappers get a defined behavior and need not be worked around.
    And the other two formats get the ability to retain invalid sequences,
    meaning you can also opt to convert everything to your native wchat_t,
    process, then convert back to (wchar8_t *). And treat it as (char *). Which
    some would call binary data. Funny name for a binary type, don't you think?

    Lars



    This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 11:10:30 CST