wchar_t (was RE: 32'nd bit & UTF-8)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Jan 22 2005 - 07:38:51 CST

Next message: Jon Hanna: "RE: wchar_t (was RE: 32'nd bit & UTF-8)"

Previous message: Jon Hanna: "BOM in HTML (was Conformance (was UTF, BOM, etc))"
Next in thread: Jon Hanna: "RE: wchar_t (was RE: 32'nd bit & UTF-8)"
Reply: Jon Hanna: "RE: wchar_t (was RE: 32'nd bit & UTF-8)"
Maybe reply: Lars Kristan: "RE: wchar_t (was RE: 32'nd bit & UTF-8)"
Maybe reply: Arcane Jill: "RE: wchar_t (was RE: 32'nd bit & UTF-8)"
Reply: Antoine Leca: "Re: wchar_t (was RE: 32'nd bit & UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Arcane Jill wrote:
> and Microsoft Wisual C++, which fixes wchar_t to SIXTEEN bits.
>
> The existence of wchar_t does not imply UTF-32. It does /not/ imply
> UTF-16. It does
> not even imply Unicode. It's just a type.

Very well put (except for the typo which I have taken the liberty of
correcting in the quote).

And as whcar_t was mentioned, I realized it has a lot to do with the text vs
binary data distinction. But let me start at the beginning.

What is wchar_t? Yes, it is a Unicode related type. It does not imply
Unicode. But nor does its absence imply no Unicode. What is it then? It is
the type that is used in the implementation (or rather interface) of the
basic Unicode functions in a compiler (though possibly related to system
API).

By declaring a single (and rather loose) type, two things have happened:
* Due to different implementations the source code became less portable.
* A notion was created that only a single implementation if the basic
functions is needed.

A single implementation approach is used often. It is because it modularizes
things and because it is often natural and efficient. But not always. An
analogy of the contrary would be a graphic library that would only define a
PutPixel function, claiming it suffices. Mathematically, yes, but it is far
from efficient (in terms of performance) and far from user (programmer)
friendly.

Back to wchar_t. Let's introduce wchar32_t. Most of Unicode functions can be
implemented using that type. But it may also be useful to define some of
those functions for UTF-8 strings. Do we need a new type for that? In C, one
would get away with the char type, but for C++ it would be useful to
introduce the wchar8_t type. Now notice that while you can implement some
functions for wchar32_t type with characters, the same function for wchar8_t
type must (well, should) operate on strings:
BOOL isspace(wchar32_t), but BOOL isspace(wchar8_t *).
Where I am deliberately abandoning the int and wint_t types typically used
in such functions.

The shift from charater to sting is very useful. For example:
wchar32_t * strchr(wchar32_t *, wchar32_t), but wchar8_t * strchr(wchar8_t
*, wchar8_t *).

The wchar8_t * strchr(wchar8_t *, wchar8_t *) is close to wchar8_t *
strstr(wchar8_t *, wchar8_t *). Except that strchr should tolerate overlong
strings in the second parameter, meaning it would observe only the first
codepoint and would not require that it be nul terminated (and the same goes
for the string version of isspace).

An implementation of wchar32_t * strchr(wchar32_t *, wchar32_t *) is also
useful once you realize you also have the generic wchar_t type and want to
write generic code with as little impact as possible. With the wint_t type,
you constantly need to transform from strings to wint_t, and you'll keep
doing it even when it is not necessary. Or will add extra code to avoid it.

Another reason is that some functions cannot be implemented using the
character input, even with wchar32_t. Outputting for display is just one of
them. I am not sure about collation, I'll leave that to the experts.

Of course then you have the wchar16_t. Windows. Here BOOL isspace(wchar16_t)
is actually int iswspace(wint_t), where wint_t is 16 bit. Windows is UCS-2.
The way to extend it is to introduce BOOL isspace(wchar16_t *), possibly as
int isspace16(wchar16_t *). Since on Windows wchar_t equals wchar16_t, you
can use the isspace16 with pointers to native wchar_t strings.

I've been assuming overloading (so C++) in the BOOL functions. Consider it
pseudo code. For non-overloading (i.e. C) examples, naming convention issues
arise. Especially where wint_t functions would get the wchar_t counterparts.
I'd even stick with the latter and not even implement the character based
functions. Thus reducing the amount of functions.

What remains is the definition of the wcharNN_t types. The names suggest
actual size, but it is probably simply the smallest allowed size. But
typically they will be equal.

Would every run time library be required to implement functions for all
three types? Probably not. One would suffice, with wchar_t being equal to
its native implementation type. Adding other types would increase the
portability of the platform. Alas, writing wrappers would be fairly simple,
I suppose.

And, finally, to get back to the text vs binary distinction. On UNIX,
(wchar8_t *) would equal (char *). Meaning no distinction, no conversion. At
least by default.

On Windows, conversion from (char_t *) to (wchar8_t *) would imply ACP based
conversion. In C++ this could be an overloaded type conversion. But you
could disable that (or simply cast) and get the UNIX behavior, should you
need that.

What is interesting is that you can do that for the (char_t *) / (wchar8_t
*) pair. You however MUST convert between (char_t *) and the other two types
(16 and 32). And in this case, you will lose invalid sequences. This makes
the (wchar8_t *) based processing the most robust and the only useful
alternative where such behavior is needed.

One problem is that even in (wchar8_t *) based processing, one might find
out that the (wchar8_t *) functions are just wrappers to (wchar32_t *)
functions. In such cases, in order to retain the full power of (wchar8_t *)
processing, one would need to add extra code to alleviate that problem. Not
impossible, but tedious, and error prone. But not always needed,
fortunately.

The other problem is that (wchar8_t *) based processing might not be
possible, for example if a platform does not provide even the (wchar8_t *)
wrappers. Which might be the case with Windows. Of course you can write the
wrappers yourself, perhaps find third party wrappers. But there could be an
incurred cost if you need to constantly convert from UTF-8 (wchar8_t *) each
time you want to call system APIs.

Both problems can be solved with one simple change. By introducing 128
codepoints to allow the roundtrip of invalid sequences in UTF-8. Then the
(wchar8_t *) wrappers get a defined behavior and need not be worked around.
And the other two formats get the ability to retain invalid sequences,
meaning you can also opt to convert everything to your native wchat_t,
process, then convert back to (wchar8_t *). And treat it as (char *). Which
some would call binary data. Funny name for a binary type, don't you think?

Lars

Next message: Jon Hanna: "RE: wchar_t (was RE: 32'nd bit & UTF-8)"
Previous message: Jon Hanna: "BOM in HTML (was Conformance (was UTF, BOM, etc))"
Next in thread: Jon Hanna: "RE: wchar_t (was RE: 32'nd bit & UTF-8)"
Reply: Jon Hanna: "RE: wchar_t (was RE: 32'nd bit & UTF-8)"
Maybe reply: Lars Kristan: "RE: wchar_t (was RE: 32'nd bit & UTF-8)"
Maybe reply: Arcane Jill: "RE: wchar_t (was RE: 32'nd bit & UTF-8)"
Reply: Antoine Leca: "Re: wchar_t (was RE: 32'nd bit & UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 07:41:29 CST