RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)

From: Lars Kristan ([email protected])
Date: Fri Jan 21 2005 - 09:54:41 CST

Next message: Peter Kirk: "Re: So how about U+D7FD for a NOP then?"

Previous message: Hans Aberg: "Byte-oriented lexer generator for Unicode"
Maybe in reply to: Lars Kristan: "UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Next in thread: Andy Heninger: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Reply: Andy Heninger: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Reply: Antoine Leca: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Antoine Leca wrote:
> UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)Lars Kristan wrote
> > * On UNIX, fopen is always binary, text mode is rare and even then
> > very close to binary.
>
> You have it reversed. According to the Standards (and it was
> a decision of
> the C Standard to make it this way, there actually was
> previous usages of
> the reverse convention such as "wt" you can find with MS-DOS
> compilers),
> fopen is normally text ("w"), binary mode ("wb") is rare and even then
> identical to text.

I did not have it reversed. But maybe I was a bit too terse, sorry, am
_trying_ to keep things short, but with such complex issues that is not
always possible. Anyway, here is what I meant:

Explanation of the "On UNIX, fopen is always binary" part:

UNIX opens files in binary mode. No bytes are interpreted, dropped, changed,
nor added, plus seeks are simple and efficient. From the standard's
perspective you could say this is text mode, as it is indeed specified, but
I insist that this is binary mode, from user's perspective. All UNIX does to
satisfy the standard is that it IGNORES the 'b' part of the type parameter.

Explanation of the "text mode is rare" part:

With text mode I was not referring to the fopen anymore. It actually goes
with the corresponding line for Windows which was:
* On Windows, fopen has a text mode, programs have a /b switch.

So, the text mode I was referring to is in the programs, not in the system
or run time libraries. An example is in ftp (remember BIN?).

I wrote "and even then very close to binary", and meant:

Although some programs do interpret the streams as text, they often
interpret very few characters, for example CR, LF, space, delimiters. Even
if a stream contains byte values (or sequences) that have no representation
in the current locale, they get through. Either they are not processed at
all and just passed on, or they are often even processed meaningfully, like
considered as part of words in word counts.

BTW (yes, again and again): This is something Windows is not able to
achieve. But that does not mean no Unicode application is able to do it.
Application that processes text in UTF-8 is also able to do it. UTF-16
applications on the other hand are not.

> This does not change anything to your point, which still holds.

Phew.

Lars

Next message: Peter Kirk: "Re: So how about U+D7FD for a NOP then?"
Previous message: Hans Aberg: "Byte-oriented lexer generator for Unicode"
Maybe in reply to: Lars Kristan: "UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Next in thread: Andy Heninger: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Reply: Andy Heninger: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Reply: Antoine Leca: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 09:59:35 CST