Re: 32'nd bit & UTF-8

From: Antoine Leca (
Date: Thu Jan 20 2005 - 08:42:38 CST

  • Next message: Rick McGowan: "Re: 32'nd bit & UTF-8"

    On Thursday, January 20th, 2005 12:51Z Hans Aberg va escriure:
    >>> C++ already has a standard library for wchar_t streams.
    >> Probably. I even guess there are more than one,
    > There can not be more that one _standard_ library, i.e., a library
    > as part of the issued C++ ISO/ANSI standard. :-)

    Even with this restriction: C++ on one side builds on top of the C standard,
    so re-use the C notion of stream (stdio), which does have a wchar_t variant.
    And then on the other side we have the iostreams well-known in C++ folkore
    from Day 1, that I assume should also have a wchar_t facet.

    >> And I
    >> happen to know very well that the use of wchar_t streams (using
    >> the C meaning here, that is fwprintf etc.) is NOT widespreaded,
    >> for a lot of reasons.
    > In the past it has been so. But GNU GCC has now settled for using
    > wchar_t for 32-bit type. So there ie probably where matters are
    > heading.

    You got me wrong. Perhaps it is the direction a particular implementation is
    heading. I am just saying USERS (programmers) are not there.

    Everyone is free to develop a product nobody will use. In the commercial
    world, things usually stop quickly when some manager look at the bottom
    line. The OpenSource movement does have the nice characteristics this is not
    a stopover. For the net result, look at

    >>> Portability does not mean that the program is expected to run
    >>> on different platforms without alterations, but merely tries
    >>> to lessen those needed changes.
    >> You are certainly free to define portability the way you want.
    > This is how one define portability in the context of C/C++.

    If by "one" you mean yourself, we are in agreement.
    Now, if you mean the general meaning, definitively no.

    >> Just a philosophical point: automated parsers are to be used on
    >> formatted ("spoken") datas to transform them into some binary
    >> representation suitable for posterior processing with computers.
    >> Requiring distributed datas to be binary (probably based on
    >> efficiency criterias) is about taking just the opposite path.
    > I do not understand what you mean here:

    I just mean that if you are interested in parsers as you say, you should not
    worry about binary data interchanges.

    > All data in computers are
    > binary. protocols for use in distrubuted data, like HTML, does
    > ensure that the binary data look the same over platforms.

    Yes. More below about your example.

    > But when using a C/C++ compiler this is not so:

    Yes it is. The first step of the formal model of a C/C++ compiler (according
    to both ISO standards) is to map the physical source characters into an
    internal representation. So it is the job of the compiler vendor to actually
    ensure of the similarity when it comes to C/C++ sources.

    This is exactly the same as relying on the HTTP server and client to pass
    the HTML stream from the producer (the guy that wrote the page) to the user
    (the browser). They have no way to say if on the wire, they will accord to
    use UTF-8 over UTF-16 or whatever. Nor they actually care.
    If _you_ care, you have to go at HTTP level (or below) to have the
    information. So much for HTML protocol.

    If you are talking about C/C++ programs instead of compilers, it is more of
    the same.

    And the C/C++ paradigm is to use textual data when communicating (which is
    the framework targetted by Unicode). If you want more precise behaviour at
    binary level, you probably should consider at least Posix instead, or
    perhaps some ABI built on top of it. And also restrict your low-level I/O to
    unsigned char, C (so C++) has definitive provisions to ensure what you want
    (or what you pretend to want) using them.

    > Unicode is protccol for distributed data_ One expecte the code
    > points tyyo mena the same things everywhere. But in the "Unicode"
    > \u... construct of C++, one does not knwo anything like that, it
    > may not produce anything sensble at all.

    It certainly may ('may not' would mean that any use would be idiotic, so I
    assume this is not you wanted to write.)

    I fail to follow your reasonment. It is true that "\u..." is not fully
    portable, but it is better than nothing. And if it is not dispelled at
    compilation time, you could be fairly confident about the portability of the
    result, that is, puts("\u0040", f) will produce the same thing everywhere,
    and it will be an AT SIGN (under local conventions, which may be subject to
    compiler switches), nothing else. Which is really a GOOD thing.

    Now if you want to querell that the result will be different on an IBM iron
    than on your box, I fail to see the point.

    >>> People writing WWW-browsers and the like say it is a pain.
    >> I fail to see the point (why a browser should use \u?). Can you
    >> give an example of what you mean?
    > There was a guy, a few years ago, giving an example.
    My guess is that proper compiler's support for \u was missing then. It is a
    recent feature, first official publication is 1998 for C++ and 1999 for C
    (of course the feature is known for much more years in the Java realm).
    Also, one can note that GCC 3.1 (2002) documents that the support for this
    is "Done", while it was "Broken" in the previous release. This should be put
    in line with the fact that GCC is an actively developped compiler, with a
    large base of contributors. It could be easy to guess than lesser supported
    compilers could be behind.

    However, it is not a good point against a feature. In 2005, there are still
    people that are writing KnR-style fonctions, for lack of support of the
    prototypes in all the compilers they are using. But you would not advocating
    against this, saying for example that "prototypes are a pain" ;-).


    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 10:35:11 CST