Re: 32'nd bit & UTF-8

From: Antoine Leca (
Date: Fri Jan 21 2005 - 05:30:53 CST

  • Next message: Antoine Leca: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"

    [One should NEVER give food to a troll. They are always hungry.]

    On Thursday, January 20, 2005 20:46Z Hans Aberg va escriure:

    > On 2005/01/20 15:42, Antoine Leca wrote:
    >>> There can not be more that one _standard_ library, i.e., a library
    >>> as part of the issued C++ ISO/ANSI standard. :-)
    >> Even with this restriction: C++ on one side builds on top of the C
    >> standard,
    > Actually, C++ is a wholly independent language, but with a C like
    > syntax, and the requirement that C++ code can be linked with C
    > code.

    Actually, this is incorrect. C++ standard does an explicit /reference/ to
    the C standard (as well as Amd.1:1995 to C90). And in the very body of the
    C++ IS, we can read:


      -- Next, all open C streams (as mediated by the functions signatures
      declared in <cstdio>) [...]

    > It is a complex issue to figure out how these two hang together.

    I am not sure how I should take this sentence: it cannot be sentencious
    because then it would be a insult... so I cannot decide. Please elaborate
    (even more food, sorry).

    <About C++ wide-oriented iostreams>
    > That already seems to have happened with GNU GCC, which fixes
    > wchar_t to 32-bits.

    Of course it happened! It is burried inside the C++ standard; the very least
    that can do GCC in 2002 is to implement the standard!
    Yet others have it, and for a number of years. The Standard recognizes the
    work of Bill Plaugher here, and what could we found in VC6 standard library?
    well, Bill's implementation...

    <Use of wchar_t>
    >> You got me wrong. Perhaps it is the direction a particular
    >> implementation is heading. I am just saying USERS (programmers)
    >> are not there.
    > Those things are not widespread. But in the past, GNU has often
    > proved be leading on new features. So it may then come.

    "Often" is too strong a word here. I prefer the "may" ;-).
    And past references in this area will uncover a lot of griefs from part of
    the GNU world toward the wchar_t kludge (particularly since the first
    experiences included Sun/Solaris, where wchar_t are a developped form of
    DBCS, which are _not_ an incarnation of UCS.)
    Also, the Linux world is heading toward another direction, namely using
    UTF-8 encoded strings (following Plan9's Rob Pike analysis) instead of
    relying on wchar_t.

    I see the users dubious in the middle of the field.

    Happily, there is light; it is called, IUC ! (Only halk-joking.)

    >>>>> Portability does not mean that the program is expected to run
    >>>>> on different platforms without alterations, but merely tries
    >>>>> to lessen those needed changes.
    >>>> You are certainly free to define portability the way you want.
    >>> This is how one define portability in the context of C/C++.
    >> If by "one" you mean yourself, we are in agreement.
    >> Now, if you mean the general meaning, definitively no.
    > It is quote from BS (principal designer of C++) somewhere, I think,
    > but I do not remember where. Perhaps it is in his "DEC++". Check it
    > out in the C/C++ standards newsgroups.

    Sorry, please do your homework yourself. A quest over
    Stroustrup+portability+lessen does not show anything on G* groups.
    Before I answered you, I checked the official statements, that is the texts
    from both standards and the documents the working groups published.
    Dr. Stroutrup is certainly an important reference in the C++ world, yet he
    is not the only person whose advices are taken in account; I do not see any
    operational influence of him in the C context (I see only one message from
    him in comp.std.c, and it was a crosspost); and he has a record of
    contradicting declarations.

    Then, if you are referring about the portability from C to C++, it is
    completely out of scope.

    >> And the C/C++ paradigm is to use textual data when communicating
    >> (which is the framework targetted by Unicode).
    > But only within the framework of each single compiler.

    No. Please read what I wrote instead on commenting over your own idea.

    > In fact, sometimes even the different compilers on the same
    > platform use different binary models, at least in the past.


    > Then special efforts are required when object code form different
    > compilers should be linked together.

    C and Unix standardization objective: source code is important, binary is
    not. This is the very reason why the OpenSource movement developped, BTW.

    > It is a pain, when that happens, because the
    > program just do not run properly, and one does not know why.

    Perhaps an error in the Unix (source-based) paradigm in the first place?

    >> And also restrict your low-level I/O to
    >> unsigned char, C (so C++) has definitive provisions to ensure what
    >> you want (or what you pretend to want) using them.
    > There is no guarantee that these will be 8-bit bytes.

    On Posix (-2001 and up), yes they will be.

    And even on platforms where bytes are not 8-bit, it just works. Portability
    problems toward such architectures do exist, but usually they come from bad
    programming practices in the first place; once this is corrected toward
    better orthodoxy (like using unsigned char instead on relying on signed char
    to be sign-extending from -128 to 127), without any constraint for the
    corrected source to run on a classical platform, it works generally well on
    the strange one as well. Practice has shown _much_ more problem with
    pointers not being either of int or long width, or EBCDIC, or the confusion
    in Unix between binary and text files.

    <About \u...>
    > The problem is not having such features, but that they are not
    > sufficiently specific when putting requirements on the underlying
    > binary model. This then causes problems when working with Unicode,
    > unless the compiler writer has decided to fill in Unicode friendly
    > features in the lack of the standard defining them.

    Do you know of __STDC_ISO_10646__ ?

    A point you are missing is that wchar_t was introduced about at the same
    time as Unicode (around 1990), and the initial idea was NOT to match them
    exactly (re-read Dr Kuhn's paper). It is true that seen from 2005, it makes
    a lot of sense to merge these two notions, and it looks like some guys in
    the GNU world (like Dr Kuhn) advocates it. Yet it is difficult to convince
    compilers' vendors to move away from their supported base to another,
    different, model: there is a migration cost (as Dr Kuhn points out, BTW).
    This is the basic reason why a new paradigm has been proposed, which relies
    on new typenames (char16_t and char32_t, in TR19769). ICU is another,
    similar, tentative.

    Yet at least Dr Kuhn, in his (now aging) paper, makes very clear it is _one_
    way to implement Unicode in the Unix/Linux world.
    You are coming here with much more arrogance, saying that this proposition
    should in fact be the Only Law and should be implemented by everyone,
    everywhere. Basing your reasoning on one view to implement a Unicode version
    of lex.
    Sorry, but it will not work.

    In another post:
    >>> The main problem is that in some domains, UTF-16 is already at
    >>> use.
    >>> So there, one would need time to change.
    >> This assumes that UTF-16 is 'wrong', isn't it? And furthermore,
    >> that UNIX (whatever you are hiding behind this word) is 'right'.
    > That seems to be the case, as a tendency of what one will actually
    > use.

    Look, use of UTF-16 has been probably multiplicated by a factor of 10 or
    more between 1999 and now. And overwhelmingly surpasses uses of UTF-32. So
    much about tendancies.

    > Yes. But I since memory in computers double every 18 months or
    > faster, this should not be of much problem.

    Dr. Moore's law is about transistors' number, and is mistakely extrapolated
    to computing power. You are creatively applying it to memory size now.
    Following your point, we should prepare UTF-512 for year... 2011.



    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 05:36:43 CST