The article doesn't really seem to answer the question the title says it does.<p>Of course there's the well-known reasons, nonlinearity of power vs frequency scaling, diminishing returns in hardware design, etc. But there are others that we don't hear so much about.<p>Hardware design is still in pretty nascent stage, technology-wise. The languages used (say SystemC or Verilog) offer very little high-level abstraction, and the simulation tools suck. Sections of the CPU are still typically designed in isolation in an ad-hoc way, using barely any measurements, and rarely on anything more than a few small kernels. Excel is about the most statistically advanced tool used in this. Of course, CPUs are hugely intertwined and complicated beasts, and the optimal values of parameters such as register file sizes, number of reservation stations, cache latency, decode width, whatever, are all interconnected. As long as design teams only focus on their one little portion of the chip, without any overarching goal of global optimization, we're leaving a ton of performance on the table.<p>And for that matter, so is software/compiler design. The software people have just been treating hardware as a fixed target they have no control over, trusting that it will keep improving. That makes us lazy, and our software becomes more and more slow, by design (The Great Moore's Law Compensator if you will, also known as <a href="https://en.wikipedia.org/wiki/Wirth%27s_law" rel="nofollow">https://en.wikipedia.org/wiki/Wirth%27s_law</a>).<p>The same problems we see in hardware design, huge numbers of deeply intertwined parameters, also applies to software/compiler design. We're still writing in C++ for performance code, for chrissakes. And even beyond that, the parameters in software and hardware are deeply intertwined with each other. To optimize hardware parameters, you need to make lots of measurements of representative software workloads. But where do those come from, and how are they compiled? Compiler writers have the liberty to change the way code is compiled to optimize performance on a specific chip (even if this isn't done so much in practice). To get an actually representative measurement of hardware, these compiler changes need to be taken into account. Ideally, you'd be able to tune parameters at all layers of the stack, and design software and hardware together as one entity. That is, make a hardware change, then do lots of compiler changes to optimize for that particular hardware instantiation. This needs to be automated, easy to extend, and super-duper fast, to try all of the zillions of possibilities we're not touching at the moment. There's even "crazy" possibilities like moving functionality across the hardware/software barrier. Of course it's a difficult problem, but we've made almost zero progress on it.<p>Backwards compatibility is another reason. New instructions get added regularly, but those are only for cases where big gains are achieved in important workloads. For the most part, CPU designers want improvements that work without a recompile, because that's what most businesses/consumers want. One can envision a software ecosystem that this wouldn't be such a problem for, but instead we have people still running IE6/WinXP/etc. Software can move at a glacial pace, and hardware needs to accommodate it. But this of course also enables this awfully slow pace of software progress.