Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 06:51:11 CST

Next message: Hans Aberg: "RE: Subject: Re: 32'nd bit & UTF-8"

Previous message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe in reply to: Hans Aberg: "32'nd bit & UTF-8"
Next in thread: Antoine Leca: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Antoine Leca: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 10:51 +0100 2005/01/20, Antoine Leca wrote:
>> Sorry, I do not know what TUS 5.2.
>
>I am sorry for this use of an acronym. Note it is VERY frequent here. It
>stands for The Unicode Standard, one of the topic of this list, you know.
>Actually, the reference is to
>http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf, subclause 2, titled
>"ANSI/ISO C wchar_t".
>
>> You need to explain your thought here.
>
>They are not MY thoughts. In fact I disagree with part of this. But they are
>part of the standard you are discussing about (or should), and as such it is
>the byproduct of a (longstanding) consensus. Which is why I adviced you to
>have a look at it.
>
>You really should read a bit more than Markus's (otherwise good)
>introduction to Unicode.

Thanks for the pointer. But my scope was really only to figure out to
implement Unicode into Flex. I will leave these other, highly technical
issues to the experts.

>> C++ already has a standard library for wchar_t streams.
>
>Probably. I even guess there are more than one,

There can not be more that one _standard_ library, i.e., a library as part
of the issued C++ ISO/ANSI standard. :-)

> And I
>happen to know very well that the use of wchar_t streams (using the C
>meaning here, that is fwprintf etc.) is NOT widespreaded, for a lot of
>reasons.

In the past it has been so. But GNU GCC has now settled for using wchar_t
for 32-bit type. So there ie probably where matters are heading.

>> Portability does not mean that the program is expected to run
>> on different platforms without alterations, but merely tries
>> to lessen those needed changes.
>
>You are certainly free to define portability the way you want.

This is how one define portability in the context of C/C++. One does not
expect such programs to run on widely differnent architectures without
adaptations. For example, one may settlye to write code for a wide range of
32-bit machines, in order to mimize those needed adpatations.

>>> Then you could begin to understand the point: the size of the
>>> underlying type is irrelevant.
>>
>> It is not irrelevant, because the inability of knowing the
>> underlying binary structure causes problems, especially in the case
>> of distributed data.
>
>Just a philosophical point: automated parsers are to be used on formatted
>("spoken") datas to transform them into some binary representation suitable
>for posterior processing with computers. Requiring distributed datas to be
>binary (probably based on efficiency criterias) is about taking just the
>opposite path.

I do not understand what you mean here: All data in computers are binary.
protocols for use in distrubuted data, like HTML, does ensure that the
binary data look the same over platforms. But when using a C/C++ compiler
this is not so: data types can be padded in ost unexpected manners, and ther
is no good way to tell know how this padding is done but to study each
compiler explcility. SO one cannot just wrtie a C/C++ data structure to a
file, transport ocver the Internet via prtocol and expect it make sense for
the same oprogram but compiled with another compiler.

Unicode is protccol for distributed data_ One expecte the code points tyyo
mena the same things everywhere. But in the "Unicode" \u... construct of
C++, one does not knwo anything like that, it may not produce anything
sensble at all.

>This is not to lessen the need for binary exchangeable datas. In fact,
>ISO/IEC 10646 initially established an univoque scheme for data (that is,
>network order). Practive had shown it was not adequate (the sales numbers of
>Intel might be a reason for; waste of storage space, another.) My guess is
>that the oposite move, forcing little endian everywhere on the basis that
>most cores are set up this way today, won't be correct either, partly
>because it is uncomfortable for us humans.

In <http://www.cl.cam.ac.uk/~mgk25/unicode.html>, it says that big endian is
recommended in UTF-32. My guess is that until there is concencus on the CPU
level, both endianesses will persist.

>> People writing WWW-browsers and the like say it is a pain.
>
>I fail to see the point (why a browser should use \u?). Can you give an
>example of what you mean?

There was a guy, a few years ago, giving an example. Then in order to make
sure the software came out right with every compiler, he had to write say:
#define UCHAR ... /* get hold of a 32-bit integral type */

const UCHAR LATIN_CAPITAL_LETTER_A_WITH_GRAVE = 0x000C0;
...
Then use LATIN_CAPITAL_LETTER_A_WITH_GRAVE instead of \u000C0.

I do not recall, though, what kind browser or text processing software he
was writing.

Hans Aberg

Next message: Hans Aberg: "RE: Subject: Re: 32'nd bit & UTF-8"
Previous message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe in reply to: Hans Aberg: "32'nd bit & UTF-8"
Next in thread: Antoine Leca: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Antoine Leca: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 06:52:48 CST