Marco.Cimarosti@icl.com wrote:
>
> Antoine Leca wrote:
> > char C_thai[] =
> > "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33";
>
> Would the Unicode values be converted to the local SBCS/MBCS character set?
In this case, yes (assuming a normal C compiler).
With wchar_t / L"...", they are converted to the local "wide character set",
which happens to be Unicode on most boxes, with the following main exceptions:
- some (cheap) C compilers does not have any special support for wchar_t,
so it defaults to the same as cahr, and are usually 8 bit;
- with East Asian C compilers, wchar_t are either Unicode or either
a flat character coding, that is every character whether coded as SBCS or DBCS
stands, with its nominal, legacy, code, in a 16-bit or 32-bit cell
(that is different from MBCS in that the ASCII character are stored
in cells the same width as DBCS characters)
- EBCDIC implementations have their own rules (for obvious reasons), that
I do not know exactly (I am not sure they are consistent)
C99 also specifies that if __STDC_ISO_10646__ is defined, then the wchar_t
values are the Unicode codepoints (then to learn if it is UTF-16 or UTF-32,
one should look at WCHAR_MAX to learn if wchar_t are 16-bit or 32-bit).
> If yes:
>
> Is the definition of this locale info part of the C99 standard itself, or is
> it operating system's locale?
It is "implementation-defined". Which means:
- it is not required in any way by the C99 Standard itself (except if
__STDC_ISO_10646__ is defined);
- it is required to be stated in full words in the documentation for the compiler;
- it can vary as per compilation options; often the OS's current locale is
the default value, that can be overriden.
> And what happens to Unicode values that cannot be converted in that
> character set?
The compiler is required to fall back to something (it cannot refuse to
compile, nor it can simply drop the character); it is allowed to "fall back"
to different character depending on the typed character, though; so for example,
#include <stdio.h>
int main() { printf("%ls\n", L"\u00C0 table!"); return 0; }
Can produce (among others, this is UTF-8 encoded):
À table!
A table!
à table!
table!
I can continue to dissert on this subject (all of this should finally be
cooked in a FAQ anyway), but I do not want to flood the list with a marginaly
interesting subject.
Antoine
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT