Re: (Informational only: UTF-8 BOM and the real life)

From: Steven Atreju <>
Date: Mon, 30 Jul 2012 13:34:23 +0200

"Doug Ewell" <> wrote:

 |Steven Atreju wrote:
 |^Z as an EOF marker for text files was part of the MS-DOS legacy from
 |CP/M, where all files were written to a multiple of the disk block size
 |(I think 128 for CP/M and 512 for MS-DOS 1.x), and there had to be some
 |way to tell where the real text content ended. New stream-based I/O
 |calls in MS-DOS 2.0 made this mechanism unnecessary. Unix systems had no
 |legacy from CP/M, so they never had this problem.

I'm learning in this thread.
(And CP/M was that thing that Microsoft bought cheap to sell it
expensively the very next day to IBM as their consumer box OS.?!
Well, money must be made and sometimes you have to break an egg
to make an omelette. Sure thing. Providence really matters.)

 |> I.e., this is why we do have this messy text OR binary file I/O
 |> distinction like O_BINARY (for open(2)), "b" (for fopen(3)) or
 |> binmode (perl(1)). Because without those a text file will see
 |> End-Of-File at the ^Z, not at the real end of the file.
 |The reason for the text/binary distinction on DOS and Windows is
 |conversion between Unix-standard LF and Windows (DOS, CP/M)-standard

Eh, no, here you are mistaken i think. Line endings are a
different problem. There may be I/O libraries which take this
flag into account even for those, but i've not seen such an
approach yet. Seems dangerous to me, if there were.

(The perfect approach to handle the newline problem is somewhat
costly at runtime. But this is good for the power industry and
the hardware producers, is it. I remember that i've seen a tree
implementation test-comparison in the german computer magazine c´t
about a decade ago, it compared a C++ and a Java version of the
very same program, and the Java version was faster than the
full-instance-datatype in Node<datatype> template C++ version due
to the memory allocator! Microsoft and Intel still had that
«Wintel» alliance back then. I think that was the
in-between-the-lines tenor, if i recall correctly.
But i'm slowly running out of anti-Microsoftisms in this thread.)

 |CRLF. It might be true that library calls to read a file in text mode
 |will stop at ^Z, but Notepad and Wordpad don't. I know the library
 |doesn't automatically write ^Z. Almost nobody in the MS world uses the
 |^Z convention on purpose any more; many don't even know about it.

I've only seen the Cygwin *code* (very well over a decade ago).
(Well, those were the I/O streams. You really wouldn't have
wanted to see what was necessary for select(2)..
An operating system without select(2) is simply not imaginable.)

 |> (Which rises the immediate question why the Microsoft programmers did
 |> not embed the meta information in this section at the end of the file.
 |> But i don't really want to know.)
 |See above. The intent of ^Z was never to distinguish data from metadata,
 |as with the Mac data and resource forks.
 |But of course none of this has anything to do with U+FEFF.

Not so.

 |> So do the programmers have to face the same conditions? I don't
 |> really think so. They prefer driving plain text readers up the wall.
 |> Successfully.

This seems to have lost its context..

 |Again, we don't really have this kind of evil intent, though it's often
 |fun and convenient for people to imagine we do.

.. hmmmmmm ...

 |But of course none of this has anything to do with U+FEFF.

Not so.

 |Doug Ewell | Thornton, Colorado, USA
 | | @DougEwell ­

Received on Mon Jul 30 2012 - 06:37:35 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 30 2012 - 06:37:36 CDT