Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Wed Jan 19 2005 - 13:56:33 CST

  • Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"

    On 2005/01/19 15:33, Antoine Leca at Antoine10646@leca-marti.org wrote:

    >> Under C/C++ can actually use, apart from byte streams, other
    >> streams such as wchar_t.
    >
    > This could miss C/C++ objectives of portability. Please re-read TUS 5.2
    > about this.

    Sorry, I do not know what TUS 5.2. You need to explain your thought here.
    C++ already has a standard library for wchar_t streams. Portability does not
    mean that the program is expected to run on different platforms without
    alterations, but merely tries to lessen those needed changes.

    >> Under C/C++, one will use a wchar_t which is always of exactly
    >> 32-bit,
    >
    > Wrong.
    >
    >> regardless what internal word structure the CPU is using in
    >> its memory bus.
    >
    > Worse. An ABI that requires an opaque type to be of a determinate shape
    > whatever the underlying structure, is missing completely the point.
    > Fortunately Posix does not do that.

    These standrads do not require, but for example GNU GCC has already decided
    to reserve wchar_t as a 32-bit integral type. See
    <http://www.cl.cam.ac.uk/~mgk25/unicode.html>. So this is where one de facto
    is moving ahead.

    >> Moreover, the latest edition of C, C99, has types that the
    >> compiler can support where the sizes of the integral types are
    >> indicated.
    >
    > Yes. But their existence is not mandatory (at least the fixed-width one that
    > I believe you are alluding to). Depending on them makes your program less
    > portable.

    Nobody has claimed these things to be mandatory. They cannot, because of the
    needs for legacy support. But this is where one de facto is moving.

    > You would have a better luck with the proposed char32_t (TR 19769), which is
    > intended for this use. But then you would discover that it perfectly can be
    > 36 or 64 bits in length.

    This is a well known problem with C, namely that one can never know the
    underlying binary format.

    > Then you could begin to understand the point: the size of the underlying
    > type is irrelevant.

    It is not irrelevant, because the inability of knowing the underlying binary
    structure causes problems, especially in the case of distributed data. As
    long as you stay on your own platform it is of less importance, unless you
    try to write binary files or the like, in which case you need to know
    platform and compiler specific data.

    >The fact it happens to be 32 on your box in this year
    > 2005 is just one aspect of the problem. What is important is to support the
    > range from 0 to 0x10FFFF (when it comes to Unicode). 32 bits are good for
    > that, and they are widespread, so it was a choice for some ABI to select
    > this. But the domain of the type is to be restricted to 0 to 0x10FFFF,
    > nothing else. And there is no point trying to enlarge this domain, at least
    > until you are dealing with Unicode that is characters.

    C/C++ have in the past bee written in this way in order to admit a host of
    local character encodings. But Unicode tries to bypass this issue by
    creating a single universal format. Then it turns out that what was intended
    as flexibility of C/C++, in fact are a straitjacket. For example, the \u...
    construction of C++ is unusable for writing Unicode code, as one does not
    know what it will mean on the local compiler. People writing WWW-browsers
    and the like say it is a pain.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 13:57:46 CST