[OT?] The C standard library and UTF's (was RE: Text Editors and Canonical Equivalence (was Coloured diacritics))

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Fri Dec 12 2003 - 07:45:05 EST

  • Next message: Peter Kirk: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"

    Tim Greenwood wrote:
    > In my interpretation of the C standard (which I am reading from
    > http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n843.pdf) UTF-8 is not a
    > valid wchar_t encoding if your execution character set contains
    > characters outside the C0 controls and Basic Latin range, and
    > UTF-16 is not a valid wchar_t encoding if your execution character
    > set has characters outside the BMP. In other words whatever you
    > consider to be a character (which may be a combining character)
    > must be encoded in one wchar_t code unit.
    >
    > The relevant passage is
    >
    > 11 A wide character constant has type wchar_t, an integer
    > type defined in the <stddef.h> header. The value of a wide character
    > constant containing a single multibyte character that maps to a
    > member of the extended execution character set is the wide
    > character (code) corresponding to that multibyte character, as
    > defined by the mbtowc function, with an implementation-defined
    > current locale. The value of a wide character constant containing
    > more than one multibyte character, or containing a multibyte
    > character or escape sequence not represented in the extended
    > execution character set, is implementation-defined.

    I don't know. I thought a bit about this, and I think that your restrictive
    interpretation is not necessarily correct.

    After all, the C Standard just says is that a "wide character" and a
    "multibyte character" is whatever the <mbtowc> function defines them to be.

    And it is quite easy to show that the <mbtowc> function could, in turn,
    define them to be whatever the <mbrtowc> function defines them to be:

       // My hypothetical "mbtowc.c"
       #include <wchar.h>
       // (See ISO/IEC 9899:1999 - 7.20.7.2 "The mbtowc function")
       int mbtowc (wchar_t * pwc, const char * s, size_t n)
       {
          int retval;
          static mbstate_t internal;
          if (s == NULL)
          {
             // yes: we are stateful (or pretend we are)
             return 1;
          }
          retval = (int)mbrtowc(pwc, s, n, &internal);
          if (retval < 0)
          {
             retval = -1;
          }
          return retval;
       }

    As the definition of multibyte characters and wide character is now
    completely up to the <mbrtowc>, we could well adopt the convention (or call
    it "trick", if you prefer) of pretending that a 4-byte UTF-8 multibyte
    sequence is actually a sequence of *two* 2-byte multibyte sequences.

    Technically, the trick is possible because:

            a) returning 2 twice instead than 4 once guarantees the correct
    advance while scanning a string;
            b) we can actually map both our fake 2-byte multibyte sequences to
    an actual "wide character": the high and low surrogates;
            c) the <mbstate_t> object can be used to store the relevant data
    across the two calls.

    Legally, the trick is possible because of the purposely vague wording of the
    C Standard, which leaves the definition of wide and multibyte characters
    completely up to the implementation.

    Here is what I mean:

       // Excerpt from my hypothetical <wchar.h> for UTF-16 wide characters
       // ...
       // (See ISO/IEC 9899:1999 - 7.17 "Common definitions <stddef.h>")
       typedef short wchar_t;
       // ...
       // (See ISO/IEC 9899:1999 - 7.24 "Extended multibyte and wide character
    utilities <wchar.h>")
       typedef wchar_t mbstate_t;
       // ...

       // My hypothetical "mbrtowc.c" for UTF-16 wide characters
       #include <wchar.h>
       // (See ISO/IEC 9899:1999 - 7.24.6.3.2 "The mbrtowc function")
       size_t mbrtowc (wchar_t * pwc, const char * s, size_t n, mbstate_t * ps)
       {
          extern int _MyDecodeUtf8 (const char * s, size_t n, long * c32);
          extern void _MyEncodeUtf16 (long c32, wchar_t * hi16, wchar_t * lo16);
          static mbstate_t internal = 0;
          long c32;
          int retval;
          if (ps == NULL)
          {
             ps = &internal;
          }
          if (s == NULL)
          {
             pwc = NULL;
             s = "";
             n = 1;
          }
          if (*ps != 0)
          {
             if (pwc != NULL)
             {
                // output second surrogate saved in previous call
                *pwc = *ps;
             }
             // clear saved surrogate
             *ps = 0;
             // return fake multibyte length
             return 2;
          }
          retval = _MyDecodeUtf8(s, n, &c32);
          if (retval == 4)
          {
             // output first surrogate and save second surrogate for next call
             _MyEncodeUtf16(c32, pwc, ps);
             // return fake multibyte length
             retval = 2;
          }
          else if (retval >= 0 && pwc != NULL)
          {
             *pwc = (wchar_t)c32;
          }
          return retval;
       }

    If the above UTF-16 implementation could perhaps look relatively "smart", an
    UTF-8 implementation would definitely look very silly.

    However, if it we agree that defining what a "multibyte character" and a
    "wide character" are the exclusive task of <mbtowc> (and hence of
    <mbrtowc>), then the below implementation, silly as it is, could well be
    100% compliant with C99:

       // Excerpt from my hypothetical <wchar.h> for UTF-8 (or DBCS, or SBCS, or
    any byte-oriented encoding) wide characters
       // ...
       // (See ISO/IEC 9899:1999 - 7.17 "Common definitions <stddef.h>")
       typedef char wchar_t;
       // ...
       // (See ISO/IEC 9899:1999 - 7.24 "Extended multibyte and wide character
    utilities <wchar.h>")
       typedef wchar_t mbstate_t;
       // ...

       // My hypothetical "mbrtowc.c" with UTF-8 (or DBCS, or ...) wide
    characters
       #include <wchar.h>
       // (See ISO/IEC 9899:1999 - 7.24.6.3.2 "The mbrtowc function")
       size_t mbrtowc (wchar_t * pwc, const char * s, size_t n, mbstate_t * ps)
       {
          if (s == NULL)
          {
             pwc = NULL;
             s = "";
             n = 1;
          }
          if (n < 1)
          {
             return -1;
          }
          if (pwc != NULL)
          {
             *pwc = *s;
          }
          return (*s == 0) ? 0 : 1;
          // (ps is unused)
       }

    _ Marco



    This archive was generated by hypermail 2.1.5 : Fri Dec 12 2003 - 08:29:32 EST