Fun tangential anecdote regarding how interconnected and unintuitive CPU performance can be: I once made something run 20% faster by spawning a thread that did nothing but spin (i.e. <i>while (true);</i>).<p>I was trying to optimize some FEM code, toying with (hardcoded) solver parameters. On one console I had it spitting out the wall clock durations of time steps as the simulation was running, while on the other I was preparing the next run. I start compiling another version, and inexplicably the simulation in the other console gets <i>faster</i>. Like, 10%-20% less time taken per time step. "That must have been coincidence. There's <i>no way</i> the simulation got faster by compiling something in parallel." But curiosity got the better of me and I still investigated.<p>Watching the CPU speed with CPU-Z, it turned out that the simulation was indeed getting down-clocked, and that compiling something in parallel made the CPU run faster, speeding up the simulation too. WTF? And indeed, I could make the entire simulation run significantly faster by calling<p><pre><code> std::thread([](){ while (true); });
</code></pre>
at the start of main.<p>Why? Well, the simulation happens to be extremely memory-bound (sparse mat-vec multiplication in inner loop). So the CPU is mostly waiting around for data to arrive. Apparently the CPU downclocks as a result. That would be fine, if not for the fact that the uncore/memory subsystem clock speed is <i>directly tied to the current CPU speed</i>. That's right: The program was memory-bound, hence the CPU clocked down, hence the uncore clocked down, hence memory accesses became slower.<p>Knowing that feedback loop, it makes perfect sense that keeping the CPU busy with a spinning thread improves performance. But it's still one big wtf.<p>This problem eventually went away as we parallelized more and more of the simulation, giving the CPU less reason to clock down. But for related reasons, the simulation still runs faster if you prevent hyperthreading (either by disabling it in BIOS or having num threads = num hardware cores). More threads don't improve memory bandwidth and the hyperthread pairs just step on each others toes.