This is a great master class, if you will, on iterative optimization for core loops. They progressively make improvements to their method until they approach the performance of standard, optimized BLAS implementations. (About as fast as BLIS, less than Intel's MKL, which is about as fast as OpenBLAS. [0]) Timing Eigen without linking against BLAS is a little misleading, since Eigen is meant to be linked against a system BLAS.<p>You wouldn't want to use this code, but it shows you the sorts of things to start paying attention to in this performance-critical sections. I was most surprised by the fact that reordering operations to spread the same instructions apart made a significant difference.<p>(As an aside, your best bet in practical tools is using a metaprogramming library [Blaze seems to be the best], wrapping core operations in a fast BLAS implementation. I personally choose to use Blaze on top of OpenBLAS.)<p>[0] <a href="https://news.ycombinator.com/item?id=10114830" rel="nofollow">https://news.ycombinator.com/item?id=10114830</a>