A big extra cost of virtual functions in the underlying CPU not mentioned in the article: they effectively create a branch target dependency on a pointer chase. Put another way:<p>1) The virtual function address lookup requires a load from an address which is itself loaded. If neither location is cached, this has the unavoidable latency of two uncached memory accesses. Even at best, this incurs two cached L1 accesses, which is about 8-16 cycles on modern architectures.<p>2) The function call itself is dependent on the final address loaded above. None of that can proceed until the branch address is known. If cached, all is good and the core correctly predicts execution of a large number of instructions. Best case, the core may still block predicted execution shortly after due to running out of non-dependent instructions, until it knows for sure the address it should have branched to. Worst case, the branch can't proceed until the two memory accesses access.<p>In any case, nearly all of this is dwarfed by the cost to the compiled code itself: in most cases you can't inline, so simple transformations which could eliminate the function call altogether can't happen.
I worked on serious x86 clone once - we took a lot of real-world trace and ran it through our various microarchitectures to see how it would fly - dynamic C++ dispatch was interesting normally you expect something like<p><pre><code> mov r1, n(bp) ; get vtable
mov r2, n(r2) ; get method pointer
call (r2) ; call
</code></pre>
that's a really bad pipe break a double indirect load and a call - but branch prediction may be your friend ...<p>However some of the code we saw (I think it came from a Borland compiler)<p><pre><code> mov r1, n(bp) ; get vtable
push n(r2) ; get method pointer
ret ; call
</code></pre>
an extra memory write/read but always caught in L1 and on the register poor x86 it saves a register right> ... but on most CPUs of the time you're screwed for the branch prediction - CPUs had a return cache, a cheap way to predict the branch target of a return - by doing a return without a call you've popped the return cache leaving it in a bad state - EVERY return in an enclosing method is going to mispredict as well - the code will run, but slowly
<p><pre><code> for (unsigned i = 0; i < N; ++i) {
for (unsigned j = 0; j < i; ++j) {
obj->tick(j);
}
}
</code></pre>
I wouldn't go quite so far as to say that benchmarks with tight inner loops like this are <i>completely</i> useless, but they are nearly so.<p>The author is clearly aware that the real world of performance is much bigger & more complex than his simple Petri dish. Credit to him for mentioning that. It's also really refreshing to see him analysing the optimised assembly.<p>The trouble with this approach is that it's tempting to draw simple conclusions. In this case, you might be tempted to conclude "CRTP always faster than virtual dispatch", when the truth is likely to be much more situation dependent.<p>I have seen a biggish project go though a lot of effort to switch to CRTP, only to see a negligible performance impact.
"If anything doesn’t feel right, or just to make (3) more careful, use low-level counters to make sure that the amount of instructions executed and other such details makes sense given (2)."<p>This is explicit support for confirmation bias.<p>See Feynman's discussion of measuring the charge of the electron in Cargo Cult Science:<p>"Why didn't they discover the new number was higher right away? It's a thing that scientists are ashamed of—this history—because it's apparent that people did things like this: When they got a number that was too high above Millikan's, they thought something must be wrong—and they would look for and find a reason why something might be wrong. When they got a number close to Millikan's value they didn't look so hard. And so they eliminated the numbers that were too far off, and did other things like that..."<p><a href="http://neurotheory.columbia.edu/~ken/cargo_cult.html" rel="nofollow">http://neurotheory.columbia.edu/~ken/cargo_cult.html</a>
When you think you can use CRTP instead of virtual dispatch in your program, you didn't need virtual dispatch to begin with... you needed a generic algorithm to operate over your object classes. That's exactly what run_crtp() is, the CRTPInterface class is completely redundant except that it provides some degree of compile-time concept checking (which we'll hopefully get in C++17)<p>Virtual dispatch is useful for type erasure, when using abstract types from plugins, DLLs or generally "somebody elses code". IMHO, the valid use cases within a standalone program are actually fairly small.
I've done benchmarks on this fairly recently, and with the functions actually doing a lot of work (ray intersection for a raytracer), I saw practically no difference between CRTP and Virtual Functions:<p><a href="http://imagine-rt.blogspot.co.uk/2013/08/c-virtual-function-overhead.html" rel="nofollow">http://imagine-rt.blogspot.co.uk/2013/08/c-virtual-function-...</a><p>And this was with billions of calls to the functions...
So he found that dynamic dispatch was a lot more expensive. Fair enough and not very surprising. But let's quantify it a bit in absolute terms. The dynamic version of the code took 1.25s to run, during which time it performed approximately 8 x 10^8 virtual function calls. That translates to a cost per call of <i>1.5 nanoseconds</i>.<p>From which my takeaway would be: In inner-loopy code for which an extra nanosecond or so per call is critical, you should avoid virtual function calls. For anything else, don't worry about it.
Instead of devirtualization, a simpler optimization, which would additionally also help in the dynamic case, is simple loop hoisting of the method pointer fetch. Instead of doing<p><pre><code> while(...) {
(obj->vtable[0])(...)
}
</code></pre>
we could have<p><pre><code> void(*fn)(...) = obj->vtable[0]
while(...) {
fn(...)
}
</code></pre>
which would avoid two redirections per inner loop! Actually, I'm almost sure that is what LuaJIT would do, and many other high-level programming languages could perform this optimization as well. However, maybe C is too low-level to be able to do that, and I don't know about C++.
This is a nice article and props for including and dissecting generated assembler!<p>A key thing here is that inlining is what enables zero-cost abstractions in C++. A virtual call is slower than a regular call, but the main problem is that it builds a barrier that effectively stops inlining.<p>It'll be interesting to see how devirtualization in GCC will do for real world programs.
Observation: the intricacies of our technologies are growing to such complexity that analysis of the things we once had a direct hand in the design of plays out much like the analysis of some sort of mysterious natural phenomenon.
its interesting to see a break down of this - especially using modern compilers on the intel platform.<p>did you try the intel compiler? for raw low level optimisation it sometimes massively out performs the ms, gcc or clang versions...<p>i'd imagine these problems are worse on ARM chips, and dynamic dispatch is even less effective there - certainly on PPC architectures I've seen much worse performance than on similarly powered Intels in precisely this situation. the caches are less and slower...<p>i'm not 100% but i think i've seen virtual calls 'devirtualised' by the MS compiler a couple of years ago... I might be thinking of something else though, it was a while back now. I was unpicking some CRTP mess in something that /was not performance critical in anyway/...
I'd like to see a comparison of calling a dynamically linked function call vs a non-dynamically linked virtual call.<p>Dynamic linking has more indirection than you might expect because the function addresses can't always just be put at the call site during the library load (the places where you would want to write the address can be in code that is read-only mmapped to aid in sharing memory between processes and to avoid loading unused stuff from disk).
This _could_ be another case of premature optimization, as gcc 4.9+ could automagically devirtualize non-overridden virtual functions. icc could do that for years.