Re: Unicode String literals on various

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Tue Aug 08 2000 - 13:10:59 EDT


Marco.Cimarosti@icl.com wrote:
>
> Antoine Leca wrote:
> > char C_thai[] =
> > "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33";
>
> Would the Unicode values be converted to the local SBCS/MBCS character set?

In this case, yes (assuming a normal C compiler).

With wchar_t / L"...", they are converted to the local "wide character set",
which happens to be Unicode on most boxes, with the following main exceptions:

- some (cheap) C compilers does not have any special support for wchar_t,
 so it defaults to the same as cahr, and are usually 8 bit;

- with East Asian C compilers, wchar_t are either Unicode or either
 a flat character coding, that is every character whether coded as SBCS or DBCS
 stands, with its nominal, legacy, code, in a 16-bit or 32-bit cell
 (that is different from MBCS in that the ASCII character are stored
 in cells the same width as DBCS characters)

- EBCDIC implementations have their own rules (for obvious reasons), that
 I do not know exactly (I am not sure they are consistent)

C99 also specifies that if __STDC_ISO_10646__ is defined, then the wchar_t
values are the Unicode codepoints (then to learn if it is UTF-16 or UTF-32,
one should look at WCHAR_MAX to learn if wchar_t are 16-bit or 32-bit).

 
> If yes:
>
> Is the definition of this locale info part of the C99 standard itself, or is
> it operating system's locale?

It is "implementation-defined". Which means:
- it is not required in any way by the C99 Standard itself (except if
 __STDC_ISO_10646__ is defined);
- it is required to be stated in full words in the documentation for the compiler;
- it can vary as per compilation options; often the OS's current locale is
 the default value, that can be overriden.

 
> And what happens to Unicode values that cannot be converted in that
> character set?

The compiler is required to fall back to something (it cannot refuse to
compile, nor it can simply drop the character); it is allowed to "fall back"
to different character depending on the typed character, though; so for example,

  #include <stdio.h>
  int main() { printf("%ls\n", L"\u00C0 table!"); return 0; }

Can produce (among others, this is UTF-8 encoded):

À table!
A table!
à table!
 table!

I can continue to dissert on this subject (all of this should finally be
cooked in a FAQ anyway), but I do not want to flood the list with a marginaly
interesting subject.

Antoine



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT