TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Cache-aware matrix multiplication – naive isn't that bad

58 pointsby m_classalmost 12 years ago

8 comments

stephencanonalmost 12 years ago
For anyone who wants to learn about this subject in depth (it is a surprisingly interesting subject once you push past the most basic blocking and vectorization techniques), the canonical reference on memory access patterns for truly high-performance matrix multiplication is “Anatomy of High-Performance Multiplication” by Goto and van de Geijn; it’s also a very readable paper.
评论 #6046751 未加载
dmlorenzettialmost 12 years ago
Probably anybody who actually implements matrix multiplication knows about ATLAS already, but this software does what the article advocates, and then some, tuning to your specific machine cache: <a href="http://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Algebra_Software" rel="nofollow">http:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Automatically_Tuned_Linear_Alge...</a>
评论 #6046006 未加载
评论 #6046229 未加载
betterunixalmost 12 years ago
Naive is going to be &quot;that bad&quot; at that size. I tried this on my own system, 4096x4096 matrix multiplication, with the naive algorithm, the naive algorithm with swapped loops, and Strassen&#x27;s algorithm. I had my implementation of Strassen&#x27;s algorithm switch to the naive algorithm (with bad locality of reference) for 64x64 matrices. The results were 530s for the textbook algorithm, 140s for the textbook algorithm with better cache access, and 94s for Strassen&#x27;s algorithm.
评论 #6046038 未加载
6cxs2hd6almost 12 years ago
As a general point, it&#x27;s easy to forget all the layers.<p>If you learn C, you feel close to the metal. I have an array. It is contiguous bytes in RAM. I&#x27;ve got the power!<p>Well maybe.<p>The array might be contiguous virtual memory, but backed by disjoint blocks of physical memory.<p>Then there&#x27;s cache. In front of ... another cache.<p>At least with C you&#x27;re closeER to the metal.
评论 #6045858 未加载
moominalmost 12 years ago
Just to point out the obvious: the thing to take away from this article isn&#x27;t that this is the right way to do matrix multiplication (ATLAS and BLAS are), but that cache considerations can beat even complexity concerns. Bjarme Stroustroup is quite fond of demonstrating the superiority of vector even when complexity would suggest list was the right approach.
评论 #6047929 未加载
评论 #6047483 未加载
carterschonwaldalmost 12 years ago
Doing cache oblivious blocking can make a huge 1000x perf difference over naive matrix multiply. And C isn&#x27;t needed for most of it.<p>I&#x27;ve some pretty cute sequential dgemm code written in mostly Haskell that&#x27;s robustly within 5x of openblas dgemm. And I&#x27;ve not tuned it much &#x2F; at all!
lettergramalmost 12 years ago
I remember learning this in one of my courses, it&#x27;s actually relatively easy to create a program for matrix multiplication. Once you know the size of your cache the speed increase by multiplying just small sections of the matrix is a pretty large increase.
评论 #6046080 未加载
gus_massaalmost 12 years ago
I think this underestimate the idea of transposing one of the matrices. Perhaps it will be discussed in a future article.<p><pre><code> n = 4096; for(i=0;i&lt;n;i++) { for(j=0;j&lt;n;j++) { for(k=0;k&lt;n;k++) C[i][j]+=A[i][k]*Bt[j][k]; } } Transposing a single matrix is quite fast, and gfortran with -O3 does this trick under the hood, so to see the difference you must use -O0 (or perhaps -O1). (And remember that in Fortran the indices are in the other order.)</code></pre>