From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Mar 01 2004 - 16:10:38 EST
What's in a wchar_t string on unix?What you'll put or find in wchar_t is
application dependant. But there's only a guarantee to find a single code unit
(not necessarily a codepoint) for characters encoded in the source and compiled
with the appropriate source charset. But this charset is not necessarily
Unicode.
At run-time, functions in the standard libraries that work with or return wide
strings only expect these strings to be encoded according to the current locale
(not necessarily Unicode).
So if you run your program in an environment where the locale is ISO-8859-2,
you'll find code units whose value between 0 and 255 match their position in the
ISO-8859-2 standard, but you won't find the corresponding character codepoints
as defined by Unicode.
A wchar_t can then be used with any charset whose minimum code unit size is
lower than or equal to the size of the wchar_t type. This may be an Unicode
encoding form, or any other encoding (except UTF-32 if wchar_t is defined as a
16-bit integer type, which is not enough to represent every single Unicode
codepoint).
wchar_t is then only convenient for Unicode, as it is generally larger than
char, but its presence does not mean it will support UTF-16 or UTF-32 (in ANSI
C, wchar_t is allowed to represent the same type as char). So you'll still be
platform dependant if you want to store a single character in a wchar_t
variable. However a "wide" string constant (of type wchar_t*) should be able to
store and represent any Unicode character or codepoint, possibly by mapping one
codepoint to several wchar_t code units...
Unlike Java's "char" type which is always an unsigned 16-bit integer on all
platforms, there's no standard size for wchar_t in C and C++...
----- Original Message -----
From: Rick Cameron
To: unicode@unicode.org
Sent: Monday, March 01, 2004 8:13 PM
Subject: What's in a wchar_t string on unix?
Hi, all
This may be an FAQ, but I couldn't find the answer on unicode.org.
It seems that most flavours of unix define wchar_t to be 4 bytes. If the locale
is set to be Unicode, what's in a wchar_t string? Is it UTF-32, or UTF-16 with
the code units zero-extended to 4 bytes?
Cheers
- rick cameron
This archive was generated by hypermail 2.1.5 : Mon Mar 01 2004 - 16:53:22 EST