Here are a couple of observations<p>1) The implementation may not be the most effecient for a larger matrix or an even more dense matrix. 1% of 52000 is 520. That divided by 32 is 8-9 additions per thread. As that number increases, increasing the number of threads (and eventually using more blocks per row) would be a good idea.<p>2) He is allocating twice as much shared memory than required. I genuinely hope that was an artefact from before. If not, that is a killer for performance. Using more shared memory per block reduces the number of concurrent blocks.<p>Note: Not sure why he is still using cuda 3.2. cuda has had csr multiplication for a few months now, and has even gone through a revision to make it even faster.