There are some issues that make the actual C overhead somewhat greater than you measured. It's still far in the lead, but not by quite as much as your numbers imply.<p>The biggest problem is that modern GCC is smart enough to optimize away much of your recursion. If you compile with full optimization as you did and then inspect the 'recursor' function in your executable with 'objdump -d', you'll see that it's essentially unrolled your function 3 times. Instead of actually calling the function, it's just doing a comparison two times out of 3.<p>The best way to defeat this and level the playing ground is probably to add "-fno-inline-functions" to the command line. When you do this, the program may well segfault rather than run because it's suddenly using 3 times the stack space. At least, that's what happened when I tried it. Depending on how your system is set up, you might be able to solve this with 'ulimit -s unlimited'.<p><pre><code> # Linux 3.11.0 x86_64 on Intel i7-4770
$ ulimit -s unlimited
$ gcc-4.8 -march=native -Wall -O3 test-function-call-overhead.c -o overhead-original
$ overhead-original
test-function-call-overhead.c: 1917046018.4902 calls per second
$ gcc-4.8 -fno-inline-functions -march=native -Wall -O3 test-function-call-overhead.c -o overhead-no-inline
$ overhead-no-inline
test-function-call-overhead.c: 402246812.8280 calls per second
</code></pre>
This would imply that our actual loop overhead for C is 5x greater than your numbers. But even this number isn't really accurate. If you run 'perf stat overhead-no-inline', you see that it is executing well under one instruction per cycle:<p><pre><code> $ perf stat overhead-no-inline
6,517,046,940 instructions # 0.62 insns per cycle
</code></pre>
There are a number of things that can cause this, but the most likely is that the run time is being limited by memory latency rather what one would think of as normal function overhead. What memory you might ask? Remember that bit about needing to adjust the stack size because we were exceeding the default? It turns out that the default 8 MB is about equal to the L3 cache on modern processors, and thus the stack is overflowing to the much slower RAM.<p>To get a better measurement, we need to fiddle with your numbers for N and K. Each stack frame is probably about 32 bytes (I fiddled with your code to figure out what was happening and don't know the exact original), so if you want the stack to fit in the 32 KB L1, you'd be limited to about 1000 levels of recursion. But this is too fast for gettimeofday() to measure accurately in microseconds -- you'd want to switch to clock_gettime() which goes down to nanoseconds.<p>But that was too much for me tonight, so I'll stop here. The final answer will depend on the exact structure of your loop, but should be somewhere in between the two numbers at about about 1 billion calls per second, which is to say about 3-4 cycles per function call, with the much of the overhead coming from the loop rather than the call overhead.