If you are interested in seeing specific data on the effect caching has on memory bandwidth, you might want to check out this paper I wrote back in 2011[1].<p>Some of my findings:<p>- Effective bandwidth is approximately 6x greater for data that fits in L1, 4.3x greater if it fits in L2, and 2.9x greater if it fits in L3.<p>- Contention for shared L3 cache can limit the speedup you get from parallelization. For instance, running two threads for a data set that fits in L3 results in a speedup of only 1.75x, rather than 2x. Four threads on one four-core processor results in a speedup of only 2x vs the single-threaded program.<p>- It takes relatively few operations for programs to become compute-bound rather than memory bound. If 8 or more "add" operations were performed per data access, we found that the effects of caching disappeared almost completely, and the program's execution was limited by processor rather than the memory bottleneck.<p>The specific magnitude of these results is machine-dependent, but I would expect the general relationships to hold for other machines with a similar cache hierarchy.<p>[1] <a href="http://www.stoneridgetechnology.com/uploads/file/ComputevsMemory.pdf" rel="nofollow">http://www.stoneridgetechnology.com/uploads/file/ComputevsMe...</a>