Re: Unicode String literals on various platforms

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Tue Aug 08 2000 - 08:55:16 EDT


Bob Jones wrote:
>
> In a C program, how do you code Unicode string literals on the following
> platforms:
> NT
> Unix (Sun, AIX, HP-UX)
> AS/400

We devised a solution for this problem in the C99 Standard.
The "solution" is named "UCN", for Universal Character Notation, and
is essentially to use the (borrowed from Java) \uxxxx notation, like
(with Ken's example)

  char C_thai[] = "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33";

And similarly wchar_t C_thai[] = L"\u0E40... or
                 TCHAR_T C_thai[] = T("\u0E40...
depending on your storing option. See below for more.

The benefit is that now, your C program is portable to any platform
where the C compiler complies to C99.
The drawback is that, nowadays, there is very few such compilers.

 
> Everything I have read says not to use wchar_t for cross platform apps
> because the size is not uniform, i.e. NT it is an unsigned short (2 bytes)
> while on Unix it is an unsigned int (4 bytes). If you create your own TCHAR
> or whatever, how do you handle string literals?

A similar problem exists with numbers, doesn't it? And the usual solution
is to *not* exchange data in internal format, but rather to use textual
representations. Agreed?

For a C _program_, where the textual representation are string litteral (rather
that array of integers), C99 UCN is the way to go.

Now, since you are talking of wchar_t vs. other forms of storing characters,
I wonder if you are not asking about the problem of the manipulated _datas_,
as opposed to the C program.

Then, I believe the solution is exactly the same as with numbers: internally
use whatever is the most appropriate to the current platform (the TCHAR_T/T()
solution of Microsoft is nice because it conveniently alternate to either
char or wchar_t depending of compilation options), but when exchanging datas,
change to a common, textual representation.

Look after the %lc %ls options of [w]printf/[w]scanf, to learn on how output/
input wide characters to/from text files. Another solution is to use "Unicode"
files, using some dedicated conversions, pretty much the same as using htons(),
ntohl(), etc. functions when dealing with low-level Internet protocols.

I agree there is currenly lacking a way in the C Standard to indicate that one
would open a text file using a specific encoding protocol (eg. UTF-16LE/BE,
or UTF-8). And the discussion on this matter have ending endless so far.

> On NT L"foobar" gives each character 2 bytes,

Yes

> but on Unix L"foobar" uses 4 bytes per character.

Depends on the compiler. Some are 4 bytes, some are 8 (64-bit boxes), some
are even only 8-bit (and are not Unicode compliant).

> Even worse I suspect is the AS/400 where the string literal is probably in
> EBCDIC.

Perhaps (and even probably, as L'a' is required to be equal to 'a' in C),
but what is the problem? You are not going to memcpy()-ing L"foobar", or
to fwrite()-ing it, are you? And I am sure your AS/400 implementation have
some way to specify on open() that a text file is really an "ASCII", rather
that EBCDIC, file. Or if it does not, it should...

Regards,
Antoine



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT