Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 06:51:11 CST

  • Next message: Hans Aberg: "RE: Subject: Re: 32'nd bit & UTF-8"

    At 10:51 +0100 2005/01/20, Antoine Leca wrote:
    >> Sorry, I do not know what TUS 5.2.
    >
    >I am sorry for this use of an acronym. Note it is VERY frequent here. It
    >stands for The Unicode Standard, one of the topic of this list, you know.
    >Actually, the reference is to
    >http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf, subclause 2, titled
    >"ANSI/ISO C wchar_t".
    >
    >> You need to explain your thought here.
    >
    >They are not MY thoughts. In fact I disagree with part of this. But they are
    >part of the standard you are discussing about (or should), and as such it is
    >the byproduct of a (longstanding) consensus. Which is why I adviced you to
    >have a look at it.
    >
    >You really should read a bit more than Markus's (otherwise good)
    >introduction to Unicode.

    Thanks for the pointer. But my scope was really only to figure out to
    implement Unicode into Flex. I will leave these other, highly technical
    issues to the experts.

    >> C++ already has a standard library for wchar_t streams.
    >
    >Probably. I even guess there are more than one,

    There can not be more that one _standard_ library, i.e., a library as part
    of the issued C++ ISO/ANSI standard. :-)

    > And I
    >happen to know very well that the use of wchar_t streams (using the C
    >meaning here, that is fwprintf etc.) is NOT widespreaded, for a lot of
    >reasons.

    In the past it has been so. But GNU GCC has now settled for using wchar_t
    for 32-bit type. So there ie probably where matters are heading.

    >> Portability does not mean that the program is expected to run
    >> on different platforms without alterations, but merely tries
    >> to lessen those needed changes.
    >
    >You are certainly free to define portability the way you want.

    This is how one define portability in the context of C/C++. One does not
    expect such programs to run on widely differnent architectures without
    adaptations. For example, one may settlye to write code for a wide range of
    32-bit machines, in order to mimize those needed adpatations.

    >>> Then you could begin to understand the point: the size of the
    >>> underlying type is irrelevant.
    >>
    >> It is not irrelevant, because the inability of knowing the
    >> underlying binary structure causes problems, especially in the case
    >> of distributed data.
    >
    >Just a philosophical point: automated parsers are to be used on formatted
    >("spoken") datas to transform them into some binary representation suitable
    >for posterior processing with computers. Requiring distributed datas to be
    >binary (probably based on efficiency criterias) is about taking just the
    >opposite path.

    I do not understand what you mean here: All data in computers are binary.
    protocols for use in distrubuted data, like HTML, does ensure that the
    binary data look the same over platforms. But when using a C/C++ compiler
    this is not so: data types can be padded in ost unexpected manners, and ther
    is no good way to tell know how this padding is done but to study each
    compiler explcility. SO one cannot just wrtie a C/C++ data structure to a
    file, transport ocver the Internet via prtocol and expect it make sense for
    the same oprogram but compiled with another compiler.

    Unicode is protccol for distributed data_ One expecte the code points tyyo
    mena the same things everywhere. But in the "Unicode" \u... construct of
    C++, one does not knwo anything like that, it may not produce anything
    sensble at all.

    >This is not to lessen the need for binary exchangeable datas. In fact,
    >ISO/IEC 10646 initially established an univoque scheme for data (that is,
    >network order). Practive had shown it was not adequate (the sales numbers of
    >Intel might be a reason for; waste of storage space, another.) My guess is
    >that the oposite move, forcing little endian everywhere on the basis that
    >most cores are set up this way today, won't be correct either, partly
    >because it is uncomfortable for us humans.

    In <http://www.cl.cam.ac.uk/~mgk25/unicode.html>, it says that big endian is
    recommended in UTF-32. My guess is that until there is concencus on the CPU
    level, both endianesses will persist.

    >> People writing WWW-browsers and the like say it is a pain.
    >
    >I fail to see the point (why a browser should use \u?). Can you give an
    >example of what you mean?

    There was a guy, a few years ago, giving an example. Then in order to make
    sure the software came out right with every compiler, he had to write say:
      #define UCHAR ... /* get hold of a 32-bit integral type */

      const UCHAR LATIN_CAPITAL_LETTER_A_WITH_GRAVE = 0x000C0;
      ...
    Then use LATIN_CAPITAL_LETTER_A_WITH_GRAVE instead of \u000C0.

    I do not recall, though, what kind browser or text processing software he
    was writing.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 06:52:48 CST