Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 14:46:54 CST

  • Next message: Peter Constable: "RE: 32'nd bit & UTF-8"

    On 2005/01/20 15:42, Antoine Leca at Antoine10646@Leca-Marti.org wrote:

    >> There can not be more that one _standard_ library, i.e., a library
    >> as part of the issued C++ ISO/ANSI standard. :-)
    >
    > Even with this restriction: C++ on one side builds on top of the C standard,

    Actually, C++ is a wholly independent language, but with a C like syntax,
    and the requirement that C++ code can be linked with C code. It is a complex
    issue to figure out how these two hang together.

    > so re-use the C notion of stream (stdio), which does have a wchar_t variant.
    > And then on the other side we have the iostreams well-known in C++ folkore
    > from Day 1, that I assume should also have a wchar_t facet.
    > :-)

    That already seems to have happened with GNU GCC, which fixes wchar_t to
    32-bits.

    >>> And I
    >>> happen to know very well that the use of wchar_t streams (using
    >>> the C meaning here, that is fwprintf etc.) is NOT widespreaded,
    >>> for a lot of reasons.
    >>
    >> In the past it has been so. But GNU GCC has now settled for using
    >> wchar_t for 32-bit type. So there ie probably where matters are
    >> heading.
    >
    > You got me wrong. Perhaps it is the direction a particular implementation is
    > heading. I am just saying USERS (programmers) are not there.

    Those things are not widespread. But in the past, GNU has often proved be
    leading on new features. So it may then come.

    >>>> Portability does not mean that the program is expected to run
    >>>> on different platforms without alterations, but merely tries
    >>>> to lessen those needed changes.
    >>>
    >>> You are certainly free to define portability the way you want.
    >>
    >> This is how one define portability in the context of C/C++.
    >
    > If by "one" you mean yourself, we are in agreement.
    > Now, if you mean the general meaning, definitively no.

    It is quote from BS (principal designer of C++) somewhere, I think, but I do
    not remember where. Perhaps it is in his "DEC++". Check it out in the C/C++
    standards newsgroups.

    >> But when using a C/C++ compiler this is not so:
    >
    > Yes it is. The first step of the formal model of a C/C++ compiler (according
    > to both ISO standards) is to map the physical source characters into an
    > internal representation. So it is the job of the compiler vendor to actually
    > ensure of the similarity when it comes to C/C++ sources.

    The problem is that the underlying binary model differs from compiler to
    compiler, and there is no easy way to know that from the point of the
    language. People usually assume that there will be s specific type of
    padding, and often it is, but may not be so. This is a topic for the C/CC+
    newsgroups.

    > This is exactly the same as relying on the HTTP server and client to pass
    > the HTML stream from the producer (the guy that wrote the page) to the user
    > (the browser).

    The HTTP protocol guarantees that the binary data comes out the same over
    the network. If you take a HTTP tool written in C, and compiles it on
    different platforms, then it may not come out right because the C compilers
    may use different underlying binary models.

    > And the C/C++ paradigm is to use textual data when communicating (which is
    > the framework targetted by Unicode).

    But only within the framework of each single compiler. In fact, sometimes
    even the different compilers on the same platform use different binary
    models, at least in the past. Then special efforts are required when object
    code form different compilers should be linked together. It is a pain, when
    that happens, because the program just do not run properly, and one does not
    know why.

    > If you want more precise behaviour at
    > binary level, you probably should consider at least Posix instead, or
    > perhaps some ABI built on top of it.

    Most POSIX software is written using C. So it does not help.

    > And also restrict your low-level I/O to
    > unsigned char, C (so C++) has definitive provisions to ensure what you want
    > (or what you pretend to want) using them.

    There is no guarantee that these will be 8-bit bytes.

    And so on.

    >> There was a guy, a few years ago, giving an example.
    > ^^^^^^^^^^^^^^^
    > My guess is that proper compiler's support for \u was missing then.

    No, the support of \u... was appropriate according to the C++ standard
    because the C++ standard did not require anything special for.

    > However, it is not a good point against a feature.

    The problem is not having such features, but that they are not sufficiently
    specific when putting requirements on the underlying binary model. This then
    causes problems when working with Unicode, unless the compiler writer has
    decided to fill in Unicode friendly features in the lack of the standard
    defining them.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 14:57:35 CST