A big extra cost of virtual functions in the underlying CPU not mentioned in the article: they effectively create a branch target dependency on a pointer chase. Put another way:<p>1) The virtual function address lookup requires a load from an address which is itself loaded. If neither location is cached, this has the unavoidable latency of two uncached memory accesses. Even at best, this incurs two cached L1 accesses, which is about 8-16 cycles on modern architectures.<p>2) The function call itself is dependent on the final address loaded above. None of that can proceed until the branch address is known. If cached, all is good and the core correctly predicts execution of a large number of instructions. Best case, the core may still block predicted execution shortly after due to running out of non-dependent instructions, until it knows for sure the address it should have branched to. Worst case, the branch can't proceed until the two memory accesses access.<p>In any case, nearly all of this is dwarfed by the cost to the compiled code itself: in most cases you can't inline, so simple transformations which could eliminate the function call altogether can't happen.