Re: 32'nd bit & UTF-8

From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Thu Jan 20 2005 - 08:42:38 CST

Next message: Rick McGowan: "Re: 32'nd bit & UTF-8"

Previous message: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"
In reply to: Hans Aberg: "Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Thursday, January 20th, 2005 12:51Z Hans Aberg va escriure:
>>> C++ already has a standard library for wchar_t streams.
>>
>> Probably. I even guess there are more than one,
>
> There can not be more that one _standard_ library, i.e., a library
> as part of the issued C++ ISO/ANSI standard. :-)

Even with this restriction: C++ on one side builds on top of the C standard,
so re-use the C notion of stream (stdio), which does have a wchar_t variant.
And then on the other side we have the iostreams well-known in C++ folkore
from Day 1, that I assume should also have a wchar_t facet.
:-)

>> And I
>> happen to know very well that the use of wchar_t streams (using
>> the C meaning here, that is fwprintf etc.) is NOT widespreaded,
>> for a lot of reasons.
>
> In the past it has been so. But GNU GCC has now settled for using
> wchar_t for 32-bit type. So there ie probably where matters are
> heading.

You got me wrong. Perhaps it is the direction a particular implementation is
heading. I am just saying USERS (programmers) are not there.

Everyone is free to develop a product nobody will use. In the commercial
world, things usually stop quickly when some manager look at the bottom
line. The OpenSource movement does have the nice characteristics this is not
a stopover. For the net result, look at http://sourceforge.net/search/.

>>> Portability does not mean that the program is expected to run
>>> on different platforms without alterations, but merely tries
>>> to lessen those needed changes.
>>
>> You are certainly free to define portability the way you want.
>
> This is how one define portability in the context of C/C++.

If by "one" you mean yourself, we are in agreement.
Now, if you mean the general meaning, definitively no.

>> Just a philosophical point: automated parsers are to be used on
>> formatted ("spoken") datas to transform them into some binary
>> representation suitable for posterior processing with computers.
>> Requiring distributed datas to be binary (probably based on
>> efficiency criterias) is about taking just the opposite path.
>
> I do not understand what you mean here:

I just mean that if you are interested in parsers as you say, you should not
worry about binary data interchanges.

> All data in computers are
> binary. protocols for use in distrubuted data, like HTML, does
> ensure that the binary data look the same over platforms.

Yes. More below about your example.

> But when using a C/C++ compiler this is not so:

Yes it is. The first step of the formal model of a C/C++ compiler (according
to both ISO standards) is to map the physical source characters into an
internal representation. So it is the job of the compiler vendor to actually
ensure of the similarity when it comes to C/C++ sources.

This is exactly the same as relying on the HTTP server and client to pass
the HTML stream from the producer (the guy that wrote the page) to the user
(the browser). They have no way to say if on the wire, they will accord to
use UTF-8 over UTF-16 or whatever. Nor they actually care.
If _you_ care, you have to go at HTTP level (or below) to have the
information. So much for HTML protocol.

If you are talking about C/C++ programs instead of compilers, it is more of
the same.

And the C/C++ paradigm is to use textual data when communicating (which is
the framework targetted by Unicode). If you want more precise behaviour at
binary level, you probably should consider at least Posix instead, or
perhaps some ABI built on top of it. And also restrict your low-level I/O to
unsigned char, C (so C++) has definitive provisions to ensure what you want
(or what you pretend to want) using them.

> Unicode is protccol for distributed data_ One expecte the code
> points tyyo mena the same things everywhere. But in the "Unicode"
> \u... construct of C++, one does not knwo anything like that, it
> may not produce anything sensble at all.

It certainly may ('may not' would mean that any use would be idiotic, so I
assume this is not you wanted to write.)

I fail to follow your reasonment. It is true that "\u..." is not fully
portable, but it is better than nothing. And if it is not dispelled at
compilation time, you could be fairly confident about the portability of the
result, that is, puts("\u0040", f) will produce the same thing everywhere,
and it will be an AT SIGN (under local conventions, which may be subject to
compiler switches), nothing else. Which is really a GOOD thing.

Now if you want to querell that the result will be different on an IBM iron
than on your box, I fail to see the point.

>>> People writing WWW-browsers and the like say it is a pain.
>>
>> I fail to see the point (why a browser should use \u?). Can you
>> give an example of what you mean?
>
> There was a guy, a few years ago, giving an example.
^^^^^^^^^^^^^^^
My guess is that proper compiler's support for \u was missing then. It is a
recent feature, first official publication is 1998 for C++ and 1999 for C
(of course the feature is known for much more years in the Java realm).
Also, one can note that GCC 3.1 (2002) documents that the support for this
is "Done", while it was "Broken" in the previous release. This should be put
in line with the fact that GCC is an actively developped compiler, with a
large base of contributors. It could be easy to guess than lesser supported
compilers could be behind.

However, it is not a good point against a feature. In 2005, there are still
people that are writing KnR-style fonctions, for lack of support of the
prototypes in all the compilers they are using. But you would not advocating
against this, saying for example that "prototypes are a pain" ;-).

Antoine

Next message: Rick McGowan: "Re: 32'nd bit & UTF-8"
Previous message: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"
In reply to: Hans Aberg: "Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 10:35:11 CST