From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Fri Dec 12 2003 - 07:45:05 EST
Tim Greenwood wrote:
> In my interpretation of the C standard (which I am reading from
> http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n843.pdf) UTF-8 is not a
> valid wchar_t encoding if your execution character set contains
> characters outside the C0 controls and Basic Latin range, and
> UTF-16 is not a valid wchar_t encoding if your execution character
> set has characters outside the BMP. In other words whatever you
> consider to be a character (which may be a combining character)
> must be encoded in one wchar_t code unit.
>
> The relevant passage is
>
> 11 A wide character constant has type wchar_t, an integer
> type defined in the <stddef.h> header. The value of a wide character
> constant containing a single multibyte character that maps to a
> member of the extended execution character set is the wide
> character (code) corresponding to that multibyte character, as
> defined by the mbtowc function, with an implementation-defined
> current locale. The value of a wide character constant containing
> more than one multibyte character, or containing a multibyte
> character or escape sequence not represented in the extended
> execution character set, is implementation-defined.
I don't know. I thought a bit about this, and I think that your restrictive
interpretation is not necessarily correct.
After all, the C Standard just says is that a "wide character" and a
"multibyte character" is whatever the <mbtowc> function defines them to be.
And it is quite easy to show that the <mbtowc> function could, in turn,
define them to be whatever the <mbrtowc> function defines them to be:
// My hypothetical "mbtowc.c"
#include <wchar.h>
// (See ISO/IEC 9899:1999 - 7.20.7.2 "The mbtowc function")
int mbtowc (wchar_t * pwc, const char * s, size_t n)
{
int retval;
static mbstate_t internal;
if (s == NULL)
{
// yes: we are stateful (or pretend we are)
return 1;
}
retval = (int)mbrtowc(pwc, s, n, &internal);
if (retval < 0)
{
retval = -1;
}
return retval;
}
As the definition of multibyte characters and wide character is now
completely up to the <mbrtowc>, we could well adopt the convention (or call
it "trick", if you prefer) of pretending that a 4-byte UTF-8 multibyte
sequence is actually a sequence of *two* 2-byte multibyte sequences.
Technically, the trick is possible because:
a) returning 2 twice instead than 4 once guarantees the correct
advance while scanning a string;
b) we can actually map both our fake 2-byte multibyte sequences to
an actual "wide character": the high and low surrogates;
c) the <mbstate_t> object can be used to store the relevant data
across the two calls.
Legally, the trick is possible because of the purposely vague wording of the
C Standard, which leaves the definition of wide and multibyte characters
completely up to the implementation.
Here is what I mean:
// Excerpt from my hypothetical <wchar.h> for UTF-16 wide characters
// ...
// (See ISO/IEC 9899:1999 - 7.17 "Common definitions <stddef.h>")
typedef short wchar_t;
// ...
// (See ISO/IEC 9899:1999 - 7.24 "Extended multibyte and wide character
utilities <wchar.h>")
typedef wchar_t mbstate_t;
// ...
// My hypothetical "mbrtowc.c" for UTF-16 wide characters
#include <wchar.h>
// (See ISO/IEC 9899:1999 - 7.24.6.3.2 "The mbrtowc function")
size_t mbrtowc (wchar_t * pwc, const char * s, size_t n, mbstate_t * ps)
{
extern int _MyDecodeUtf8 (const char * s, size_t n, long * c32);
extern void _MyEncodeUtf16 (long c32, wchar_t * hi16, wchar_t * lo16);
static mbstate_t internal = 0;
long c32;
int retval;
if (ps == NULL)
{
ps = &internal;
}
if (s == NULL)
{
pwc = NULL;
s = "";
n = 1;
}
if (*ps != 0)
{
if (pwc != NULL)
{
// output second surrogate saved in previous call
*pwc = *ps;
}
// clear saved surrogate
*ps = 0;
// return fake multibyte length
return 2;
}
retval = _MyDecodeUtf8(s, n, &c32);
if (retval == 4)
{
// output first surrogate and save second surrogate for next call
_MyEncodeUtf16(c32, pwc, ps);
// return fake multibyte length
retval = 2;
}
else if (retval >= 0 && pwc != NULL)
{
*pwc = (wchar_t)c32;
}
return retval;
}
If the above UTF-16 implementation could perhaps look relatively "smart", an
UTF-8 implementation would definitely look very silly.
However, if it we agree that defining what a "multibyte character" and a
"wide character" are the exclusive task of <mbtowc> (and hence of
<mbrtowc>), then the below implementation, silly as it is, could well be
100% compliant with C99:
// Excerpt from my hypothetical <wchar.h> for UTF-8 (or DBCS, or SBCS, or
any byte-oriented encoding) wide characters
// ...
// (See ISO/IEC 9899:1999 - 7.17 "Common definitions <stddef.h>")
typedef char wchar_t;
// ...
// (See ISO/IEC 9899:1999 - 7.24 "Extended multibyte and wide character
utilities <wchar.h>")
typedef wchar_t mbstate_t;
// ...
// My hypothetical "mbrtowc.c" with UTF-8 (or DBCS, or ...) wide
characters
#include <wchar.h>
// (See ISO/IEC 9899:1999 - 7.24.6.3.2 "The mbrtowc function")
size_t mbrtowc (wchar_t * pwc, const char * s, size_t n, mbstate_t * ps)
{
if (s == NULL)
{
pwc = NULL;
s = "";
n = 1;
}
if (n < 1)
{
return -1;
}
if (pwc != NULL)
{
*pwc = *s;
}
return (*s == 0) ? 0 : 1;
// (ps is unused)
}
_ Marco
This archive was generated by hypermail 2.1.5 : Fri Dec 12 2003 - 08:29:32 EST