For anyone who wants to learn about this subject in depth (it is a surprisingly interesting subject once you push past the most basic blocking and vectorization techniques), the canonical reference on memory access patterns for truly high-performance matrix multiplication is “Anatomy of High-Performance Multiplication” by Goto and van de Geijn; it’s also a very readable paper.
Probably anybody who actually implements matrix multiplication knows about ATLAS already, but this software does what the article advocates, and then some, tuning to your specific machine cache:
<a href="http://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Algebra_Software" rel="nofollow">http://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Alge...</a>
Naive is going to be "that bad" at that size. I tried this on my own system, 4096x4096 matrix multiplication, with the naive algorithm, the naive algorithm with swapped loops, and Strassen's algorithm. I had my implementation of Strassen's algorithm switch to the naive algorithm (with bad locality of reference) for 64x64 matrices. The results were 530s for the textbook algorithm, 140s for the textbook algorithm with better cache access, and 94s for Strassen's algorithm.
As a general point, it's easy to forget all the layers.<p>If you learn C, you feel close to the metal. I have an array. It is contiguous bytes in RAM. I've got the power!<p>Well maybe.<p>The array might be contiguous virtual memory, but backed by disjoint blocks of physical memory.<p>Then there's cache. In front of ... another cache.<p>At least with C you're closeER to the metal.
Just to point out the obvious: the thing to take away from this article isn't that this is the right way to do matrix multiplication (ATLAS and BLAS are), but that cache considerations can beat even complexity concerns. Bjarme Stroustroup is quite fond of demonstrating the superiority of vector even when complexity would suggest list was the right approach.
Doing cache oblivious blocking can make a huge 1000x perf difference over naive matrix multiply. And C isn't needed for most of it.<p>I've some pretty cute sequential dgemm code written in mostly Haskell that's robustly within 5x of openblas dgemm. And I've not tuned it much / at all!
I remember learning this in one of my courses, it's actually relatively easy to create a program for matrix multiplication. Once you know the size of your cache the speed increase by multiplying just small sections of the matrix is a pretty large increase.
I think this underestimate the idea of transposing one of the matrices. Perhaps it will be discussed in a future article.<p><pre><code> n = 4096;
for(i=0;i<n;i++)
{
for(j=0;j<n;j++)
{
for(k=0;k<n;k++)
C[i][j]+=A[i][k]*Bt[j][k];
}
}
Transposing a single matrix is quite fast, and gfortran with -O3 does this trick under the hood, so to see the difference you must use -O0 (or perhaps -O1). (And remember that in Fortran the indices are in the other order.)</code></pre>