RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)

From: Lars Kristan (
Date: Fri Jan 21 2005 - 09:54:41 CST

  • Next message: Peter Kirk: "Re: So how about U+D7FD for a NOP then?"

    Antoine Leca wrote:
    > UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)Lars Kristan wrote
    > > * On UNIX, fopen is always binary, text mode is rare and even then
    > > very close to binary.
    > You have it reversed. According to the Standards (and it was
    > a decision of
    > the C Standard to make it this way, there actually was
    > previous usages of
    > the reverse convention such as "wt" you can find with MS-DOS
    > compilers),
    > fopen is normally text ("w"), binary mode ("wb") is rare and even then
    > identical to text.

    I did not have it reversed. But maybe I was a bit too terse, sorry, am
    _trying_ to keep things short, but with such complex issues that is not
    always possible. Anyway, here is what I meant:

    Explanation of the "On UNIX, fopen is always binary" part:

    UNIX opens files in binary mode. No bytes are interpreted, dropped, changed,
    nor added, plus seeks are simple and efficient. From the standard's
    perspective you could say this is text mode, as it is indeed specified, but
    I insist that this is binary mode, from user's perspective. All UNIX does to
    satisfy the standard is that it IGNORES the 'b' part of the type parameter.

    Explanation of the "text mode is rare" part:

    With text mode I was not referring to the fopen anymore. It actually goes
    with the corresponding line for Windows which was:
    * On Windows, fopen has a text mode, programs have a /b switch.

    So, the text mode I was referring to is in the programs, not in the system
    or run time libraries. An example is in ftp (remember BIN?).

    I wrote "and even then very close to binary", and meant:

    Although some programs do interpret the streams as text, they often
    interpret very few characters, for example CR, LF, space, delimiters. Even
    if a stream contains byte values (or sequences) that have no representation
    in the current locale, they get through. Either they are not processed at
    all and just passed on, or they are often even processed meaningfully, like
    considered as part of words in word counts.

    BTW (yes, again and again): This is something Windows is not able to
    achieve. But that does not mean no Unicode application is able to do it.
    Application that processes text in UTF-8 is also able to do it. UTF-16
    applications on the other hand are not.

    > This does not change anything to your point, which still holds.



    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 09:59:35 CST