Function calls aren't that slow in an OoO processor - they're perfectly predictable branches, so it can just start decoding from over there. There might be a cache miss, but there might also be fewer cache misses, or even better the CPU might skip decoding with a µop cache.<p>Really, the purpose of inlining is so inline functions can be specialized for their new context, which can easily make the total code size smaller. On x86, size/speed tradeoffs just don't happen like they used to.