Re: 32'nd bit & UTF-8

From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Thu Jan 20 2005 - 03:51:51 CST

  • Next message: Hans Aberg: "Re: UTF-8 'BOM'"

    On Wednesday, January 19th, 2005 19:56Z Hans Aberg va escriure:

    > On 2005/01/19 15:33, Antoine Leca wrote:
    >
    >>> Under C/C++ can actually use, apart from byte streams, other
    >>> streams such as wchar_t.
    >>
    >> This could miss C/C++ objectives of portability. Please re-read
    >> TUS 5.2 about this.
    >
    > Sorry, I do not know what TUS 5.2.

    I am sorry for this use of an acronym. Note it is VERY frequent here. It
    stands for The Unicode Standard, one of the topic of this list, you know.
    Actually, the reference is to
    http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf, subclause 2, titled
    "ANSI/ISO C wchar_t".

    > You need to explain your thought here.

    They are not MY thoughts. In fact I disagree with part of this. But they are
    part of the standard you are discussing about (or should), and as such it is
    the byproduct of a (longstanding) consensus. Which is why I adviced you to
    have a look at it.

    You really should read a bit more than Markus's (otherwise good)
    introduction to Unicode.

    > C++ already has a standard library for wchar_t streams.

    Probably. I even guess there are more than one, in fact (with different
    level of "standardization"), which in turn is a problem.
    I was more targetted at C since it is the subject I can control. And I
    happen to know very well that the use of wchar_t streams (using the C
    meaning here, that is fwprintf etc.) is NOT widespreaded, for a lot of
    reasons.

    > Portability does not mean that the program is expected to run
    > on different platforms without alterations, but merely tries
    > to lessen those needed changes.

    You are certainly free to define portability the way you want. I should just
    make clear my vision is different from both sentences above.

    >> Then you could begin to understand the point: the size of the
    >> underlying type is irrelevant.
    >
    > It is not irrelevant, because the inability of knowing the
    > underlying binary structure causes problems, especially in the case
    > of distributed data.

    Just a philosophical point: automated parsers are to be used on formatted
    ("spoken") datas to transform them into some binary representation suitable
    for posterior processing with computers. Requiring distributed datas to be
    binary (probably based on efficiency criterias) is about taking just the
    opposite path.

    This is not to lessen the need for binary exchangeable datas. In fact,
    ISO/IEC 10646 initially established an univoque scheme for data (that is,
    network order). Practive had shown it was not adequate (the sales numbers of
    Intel might be a reason for; waste of storage space, another.) My guess is
    that the oposite move, forcing little endian everywhere on the basis that
    most cores are set up this way today, won't be correct either, partly
    because it is uncomfortable for us humans.

    > C/C++ have in the past bee written in this way in order to admit a
    > host of local character encodings.

    Because of real-world requirements; it certainly was not a design constraint
    ;-). It is just that the Unix/C model showed its adaptability to this
    (compare with some of its competitors that were not this adaptable: they
    disappeared.)

    > But Unicode tries to bypass this issue by creating a single
    > universal format.

    Agreed. "It tries." Very right written this way ;-).

    > Then it turns out that what was intended as flexibility of C/C++,
    > in fact are a straitjacket.

    Look: Microsoft is using about only C and C++ to write its operating system
    (and no, they do not use Basic ;-)). As far as I know, this is, by a fair
    margin, the largest Unicode project today.
    I agree they do not use GNU GCC, or more exactly they use(d) it only very
    marginally.

    > For example, the \u... construction of C++ is unusable for writing
    > Unicode code, as one does not know what it will mean on the local
    > compiler.

    Please complain to your compiler (and standard library) vendor.

    I actually had very long discussions with Tom Plum about this very point
    (overspecification of \u in the C++ standard, in my eyes). And I was
    defending the position of the portable freestanding compilers (along them
    are GCC under my definition of portability), so I have a fairly good idea of
    what it is about.
    But it was in 1998. Things evolve.

    > People writing WWW-browsers and the like say it is a pain.

    I fail to see the point (why a browser should use \u?). Can you give an
    example of what you mean?

    Antoine



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 03:53:35 CST