Re: What's in a wchar_t string on unix?

From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Thu Mar 04 2004 - 05:11:28 EST

  • Next message: Edward H. Trager: "RE: SVG Fonts - Is it the Font Standard of the future?"

    On Wednesday, March 03, 2004 11:22 PM Peter Kirk va escriure:

    >>> Does it also mean wchar_t is 4 bytes if __STDC_ISO_10646__ is
    >>> defined? or does it only mean wchar_t hold the character in
    >>> ISO_10646 (which mean it could be 2 bytes, 4 bytes or more than
    >>> that?)
    >>
    > On 03/03/2004 11:27, Antoine Leca wrote:
    >
    >> The later. But if wchar_t is 16 bits, it can only encode Unicode 3.0
    >> or before. ie no UTF-16 support.
    >>
    > Surely if wchar_t is 16 bits, it CAN be used to encode the whole of
    > Unicode with UTF-16, i.e. with supplementary plane characters
    > represented as "surrogate pairs" in pairs of wchar_t.

    OK, right, the programmer CAN put whatever she wants into a wchar_t (or a
    unsigned short, for that matter).

    I was speaking about what the compiler+libc was expecting to find and to
    handle correctly. Sorry for the inexact words.

    > Whether these
    > characters SHOULD be represented as UTF-16 code units in a wchar_t
    > string (or whether representation should be either UCS-2 or UTF-32)
    > is a separate issue, probably related to how the associated libraries
    > handle the code units for surrogates.

    And also to the level of support the compiler offers for the \U00xxxxxx
    notation.

    As I wrote in other posts, an otherwise compliant compiler,
     - using 16-bit wchar_t, and
     - defining __STDC_ISO_10646__ to something (which should be less
        than 200111L, date of publication of ISO/IEC 10646-2:2001,
        first one that defined the use of the external planes)
    cannot conformingly interpret the \U00xxyyyy notation in a L"" string
    constant if xx is not 00, because it would then fails to conform to the
    requirement that any character should be represented in a single wchar_t
    (more exactly, it can do it, but should emit some warning, because the
    character does not fit into one wchar_t).

    I usually say then that a compiler with 16-bit wchar_t can only encode
    UCS-2, not UTF-16. In other words, the management of UTF-16, such as keeping
    together the pair of surrogates, pairing them when transcoding to something
    else such as UTF-8, etc., should be done by the user (or externaly provided
    libraries, obviously), because there are no way to tell if the standard
    library does it or no.
    That's said, it CAN be done, as Peter rightly said. And the rest of the job,
    that is, the handling of BMP codepoints, can be left to the compiler/system
    libraries, thanks to the support advertised by the #definition of
    __STDC_ISO_10646__.

    On the other hand, an (hypothetic, as Nelson showed) compiler/library that
    defines __STDC_ISO_10646__ to be 200111L (and provides 32-bit or wider
    wchar_t, of course), does assure that all the managing of the surrogates are
    done correctly by the standard library and associated support. As such,
    iswupper(L'\U00010400') (DESERET CAPITAL LETTER LONG I) should not return 0.

    Antoine



    This archive was generated by hypermail 2.1.5 : Thu Mar 04 2004 - 06:10:24 EST