Re: 32'nd bit & UTF-8

From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Thu Jan 20 2005 - 03:51:51 CST

Next message: Hans Aberg: "Re: UTF-8 'BOM'"

Previous message: Arcane Jill: "Re: 32'nd bit & UTF-8"
In reply to: Hans Aberg: "Re: 32'nd bit & UTF-8"
Next in thread: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Wednesday, January 19th, 2005 19:56Z Hans Aberg va escriure:

> On 2005/01/19 15:33, Antoine Leca wrote:
>
>>> Under C/C++ can actually use, apart from byte streams, other
>>> streams such as wchar_t.
>>
>> This could miss C/C++ objectives of portability. Please re-read
>> TUS 5.2 about this.
>
> Sorry, I do not know what TUS 5.2.

I am sorry for this use of an acronym. Note it is VERY frequent here. It
stands for The Unicode Standard, one of the topic of this list, you know.
Actually, the reference is to
http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf, subclause 2, titled
"ANSI/ISO C wchar_t".

> You need to explain your thought here.

They are not MY thoughts. In fact I disagree with part of this. But they are
part of the standard you are discussing about (or should), and as such it is
the byproduct of a (longstanding) consensus. Which is why I adviced you to
have a look at it.

You really should read a bit more than Markus's (otherwise good)
introduction to Unicode.

> C++ already has a standard library for wchar_t streams.

Probably. I even guess there are more than one, in fact (with different
level of "standardization"), which in turn is a problem.
I was more targetted at C since it is the subject I can control. And I
happen to know very well that the use of wchar_t streams (using the C
meaning here, that is fwprintf etc.) is NOT widespreaded, for a lot of
reasons.

> Portability does not mean that the program is expected to run
> on different platforms without alterations, but merely tries
> to lessen those needed changes.

You are certainly free to define portability the way you want. I should just
make clear my vision is different from both sentences above.

>> Then you could begin to understand the point: the size of the
>> underlying type is irrelevant.
>
> It is not irrelevant, because the inability of knowing the
> underlying binary structure causes problems, especially in the case
> of distributed data.

Just a philosophical point: automated parsers are to be used on formatted
("spoken") datas to transform them into some binary representation suitable
for posterior processing with computers. Requiring distributed datas to be
binary (probably based on efficiency criterias) is about taking just the
opposite path.

This is not to lessen the need for binary exchangeable datas. In fact,
ISO/IEC 10646 initially established an univoque scheme for data (that is,
network order). Practive had shown it was not adequate (the sales numbers of
Intel might be a reason for; waste of storage space, another.) My guess is
that the oposite move, forcing little endian everywhere on the basis that
most cores are set up this way today, won't be correct either, partly
because it is uncomfortable for us humans.

> C/C++ have in the past bee written in this way in order to admit a
> host of local character encodings.

Because of real-world requirements; it certainly was not a design constraint
;-). It is just that the Unix/C model showed its adaptability to this
(compare with some of its competitors that were not this adaptable: they
disappeared.)

> But Unicode tries to bypass this issue by creating a single
> universal format.

Agreed. "It tries." Very right written this way ;-).

> Then it turns out that what was intended as flexibility of C/C++,
> in fact are a straitjacket.

Look: Microsoft is using about only C and C++ to write its operating system
(and no, they do not use Basic ;-)). As far as I know, this is, by a fair
margin, the largest Unicode project today.
I agree they do not use GNU GCC, or more exactly they use(d) it only very
marginally.

> For example, the \u... construction of C++ is unusable for writing
> Unicode code, as one does not know what it will mean on the local
> compiler.

Please complain to your compiler (and standard library) vendor.

I actually had very long discussions with Tom Plum about this very point
(overspecification of \u in the C++ standard, in my eyes). And I was
defending the position of the portable freestanding compilers (along them
are GCC under my definition of portability), so I have a fairly good idea of
what it is about.
But it was in 1998. Things evolve.

> People writing WWW-browsers and the like say it is a pain.

I fail to see the point (why a browser should use \u?). Can you give an
example of what you mean?

Antoine

Next message: Hans Aberg: "Re: UTF-8 'BOM'"
Previous message: Arcane Jill: "Re: 32'nd bit & UTF-8"
In reply to: Hans Aberg: "Re: 32'nd bit & UTF-8"
Next in thread: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 03:53:35 CST