Re: What's in a wchar_t string on unix?

From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Thu Mar 04 2004 - 05:11:28 EST

Next message: Edward H. Trager: "RE: SVG Fonts - Is it the Font Standard of the future?"

Previous message: Ernest Cline: "RE: SVG Fonts - Is it the Font Standard of the future?"
In reply to: Peter Kirk: "Re: What's in a wchar_t string on unix?"
Next in thread: Rick Cameron: "RE: What's in a wchar_t string on unix?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Wednesday, March 03, 2004 11:22 PM Peter Kirk va escriure:

>>> Does it also mean wchar_t is 4 bytes if __STDC_ISO_10646__ is
>>> defined? or does it only mean wchar_t hold the character in
>>> ISO_10646 (which mean it could be 2 bytes, 4 bytes or more than
>>> that?)
>>
> On 03/03/2004 11:27, Antoine Leca wrote:
>
>> The later. But if wchar_t is 16 bits, it can only encode Unicode 3.0
>> or before. ie no UTF-16 support.
>>
> Surely if wchar_t is 16 bits, it CAN be used to encode the whole of
> Unicode with UTF-16, i.e. with supplementary plane characters
> represented as "surrogate pairs" in pairs of wchar_t.

OK, right, the programmer CAN put whatever she wants into a wchar_t (or a
unsigned short, for that matter).

I was speaking about what the compiler+libc was expecting to find and to
handle correctly. Sorry for the inexact words.

> Whether these
> characters SHOULD be represented as UTF-16 code units in a wchar_t
> string (or whether representation should be either UCS-2 or UTF-32)
> is a separate issue, probably related to how the associated libraries
> handle the code units for surrogates.

And also to the level of support the compiler offers for the \U00xxxxxx
notation.

As I wrote in other posts, an otherwise compliant compiler,
- using 16-bit wchar_t, and
- defining __STDC_ISO_10646__ to something (which should be less
than 200111L, date of publication of ISO/IEC 10646-2:2001,
first one that defined the use of the external planes)
cannot conformingly interpret the \U00xxyyyy notation in a L"" string
constant if xx is not 00, because it would then fails to conform to the
requirement that any character should be represented in a single wchar_t
(more exactly, it can do it, but should emit some warning, because the
character does not fit into one wchar_t).

I usually say then that a compiler with 16-bit wchar_t can only encode
UCS-2, not UTF-16. In other words, the management of UTF-16, such as keeping
together the pair of surrogates, pairing them when transcoding to something
else such as UTF-8, etc., should be done by the user (or externaly provided
libraries, obviously), because there are no way to tell if the standard
library does it or no.
That's said, it CAN be done, as Peter rightly said. And the rest of the job,
that is, the handling of BMP codepoints, can be left to the compiler/system
libraries, thanks to the support advertised by the #definition of
__STDC_ISO_10646__.

On the other hand, an (hypothetic, as Nelson showed) compiler/library that
defines __STDC_ISO_10646__ to be 200111L (and provides 32-bit or wider
wchar_t, of course), does assure that all the managing of the surrogates are
done correctly by the standard library and associated support. As such,
iswupper(L'\U00010400') (DESERET CAPITAL LETTER LONG I) should not return 0.

Antoine

Next message: Edward H. Trager: "RE: SVG Fonts - Is it the Font Standard of the future?"
Previous message: Ernest Cline: "RE: SVG Fonts - Is it the Font Standard of the future?"
In reply to: Peter Kirk: "Re: What's in a wchar_t string on unix?"
Next in thread: Rick Cameron: "RE: What's in a wchar_t string on unix?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Mar 04 2004 - 06:10:24 EST