Speed memory access by arranging data to take advantage of CPU caching

149 点作者 jervisfm大约 11 年前

15 条评论

modeless大约 11 年前

I've recently been using <a href="http://halide-lang.org/" rel="nofollow">http://halide-lang.org/</a>, a language designed to separate the specification of an algorithm from the order of the computations and the layout of the data in memory. It allows you to first get an algorithm working and then quickly try a huge number of permutations of the options for data layout, parallelism, and vectorization. You can get incredible speedups vs. a naive series of C for loops.

评论 #7543673 未加载

评论 #7543548 未加载

评论 #7547144 未加载

评论 #7543909 未加载

alexhutcheson大约 11 年前

If you are interested in seeing specific data on the effect caching has on memory bandwidth, you might want to check out this paper I wrote back in 2011[1].Some of my findings:- Effective bandwidth is approximately 6x greater for data that fits in L1, 4.3x greater if it fits in L2, and 2.9x greater if it fits in L3.- Contention for shared L3 cache can limit the speedup you get from parallelization. For instance, running two threads for a data set that fits in L3 results in a speedup of only 1.75x, rather than 2x. Four threads on one four-core processor results in a speedup of only 2x vs the single-threaded program.- It takes relatively few operations for programs to become compute-bound rather than memory bound. If 8 or more "add" operations were performed per data access, we found that the effects of caching disappeared almost completely, and the program's execution was limited by processor rather than the memory bottleneck.The specific magnitude of these results is machine-dependent, but I would expect the general relationships to hold for other machines with a similar cache hierarchy.[1] <a href="http://www.stoneridgetechnology.com/uploads/file/ComputevsMemory.pdf" rel="nofollow">http://www.stoneridgetechnology.com/uploads/file/ComputevsMe...</a>

Tiktaalik大约 11 年前

First I heard of this sort of thing was from DICE's presentation on "Data Oriented Design" <a href="http://dice.se/publications/introduction-to-data-oriented-design/" rel="nofollow">http://dice.se/publications/introduction-to-data-oriented-de...</a>I think they'd differ from the author here that this isn't an optimization to go to when performance starts to suffer, but a design that you have to start using from the beginning, because it's a big change to refactor everything to this style.

评论 #7547694 未加载

assholesRppl2大约 11 年前

The best cache-aware programming lesson I ever received was in the CS61C course at Berkeley -- building a cache-blocking algorithm to run a matrix multiplication function using the cache as efficiently as possible. We unrolled loops so that the size of each iteration was exactly the size of one cache block, and saw instantly the increase in FLOpS.Then we did some OpenMP parallelization. That was cool.Nice post!

评论 #7544517 未加载

评论 #7543885 未加载

sriram_malhar大约 11 年前

Martin Thompson has a blog and a series of talks on "Mechanical Sympathy" -- being in tune with the machine to extract performance, like a race car driver."How to get 100K TPS with 1ms latency": <a href="http://www.infoq.com/presentations/LMAX" rel="nofollow">http://www.infoq.com/presentations/LMAX</a>Blog: <a href="http://mechanical-sympathy.blogspot.in" rel="nofollow">http://mechanical-sympathy.blogspot.in</a>

hadoukenio大约 11 年前

If you're interested in this type of thing, I highly recommend "Code Optimization: Effective Memory Usage" by Kris Kaspersky:<a href="http://www.amazon.com/Code-Optimization-Effective-Memory-Usage/dp/1931769249/" rel="nofollow">http://www.amazon.com/Code-Optimization-Effective-Memory-Usa...</a>

rdtsc大约 11 年前

This is a very good article on the topic.I hope universities start to catch up and build this, along with distributed systems computing and networking, into their core curriculum.Mix multi-threaded programming with cache effects and there is a whole new world of side effects and un-expected results. There is a side-note in the article about it, but I think that deserve a whole article on it too!It is interesting that this kind of plays a bit in the favor of functional languages. They are often data centric more than code centric. As in your lay out your data then pass it to your functions as opposed to thinking about the layout of the functionality as different objects that just happen to encapsulate data (so this separates and breaks down the data).

评论 #7545480 未加载

mortenlarsen大约 11 年前

For those on GNU/Linux who want to know about the CPU caches on their machine the "lscpu" gives more detailed cache info than /proc/cpuinfo.I got curious about how lscpu was determining L1 Data and L1 Instruction caches. So I had a look at the source code of lscpu and found that it was looking in /sys/devices/system/cpu/cpu/cache/ and /sys/devices/system/cpu/cpu/topology/This lead me to "Memory part 4: NUMA support" <a href="http://lwn.net/Articles/254445/" rel="nofollow">http://lwn.net/Articles/254445/</a>Pretty interesting stuff.

nawb大约 11 年前

So this is all cool, and it's incredibly nifty to have the hardware insight while writing software. But these optimizations (loop tiling, loop fusion, etc) could (and, to my knowledge, ARE) part of basic gcc and java compilers. Why are they not more commonly used? Why do we have to specifically provide a flag to say "Hey btw I almost forgot, make my program 50 times faster."I'm slightly in the dark as to why loop optimizations are not part of the default compile process.

评论 #7544756 未加载

评论 #7544419 未加载

评论 #7545636 未加载

jaredlwong大约 11 年前

Alignment is another key the author forgot to mention. Cache aligning data is seriously important, even if it means padding stuff like crazy. Caches are pretty big, but having to load two different cache lines and do some bit trickery is a real killer.Also, if you really want performance, go parallel. Cilk is amazing and there are forks of both gcc and clang with cilk. Cilk is seriously awesome.

评论 #7545120 未加载

natebrennand大约 11 年前

There are some interesting implications on databases in terms of column-stores vs row-stores due to CPU caching. Besides the I/O improvements, column-stores benefit greatly from better CPU usage because the data is iterated over by type (column by column), so the same operators stay in the cache. This allows data to be pipelined in to the CPU and processed very quickly relative to traditional row-stores. It can even be efficient to operate on compressed data [1] if the relevant decoding data can be kept in the cache.In case you missed it, there was a relevant article last week about how B-Trees benefit by reducing the number of cache requests.[1]. <a href="http://db.lcs.mit.edu/projects/cstore/abadisigmod06.pdf" rel="nofollow">http://db.lcs.mit.edu/projects/cstore/abadisigmod06.pdf</a>[2]. <a href="http://www.me.net.nz/blog/btrees-are-the-new-black/" rel="nofollow">http://www.me.net.nz/blog/btrees-are-the-new-black/</a>

评论 #7544714 未加载

deletes大约 11 年前

I was curious about this as I have never done any such testing, so I wrote a simple c program.The object was a struct with a size of a typical game object( 100B ), then I trashed the memory by doing a lot of malloc/free using the size of the object. The two tests were one with an array of objects( contiguous array ) and the other an array of pointers to objects which were allocated separately. Then I iterated over the object doing some very simple operation( just enough to access it, and identical for each object ) and timed this. And of course I trashed the memory some more for each time.The time taken ratio is:array of objects : array of pointers to objects1 : 1.189The second test is identical except I made some extra effort to make sure that every object in array of pointers was not adjacent to any previous one in memory:array of objects : array of pointers to objects1 : 2.483The difference got much smaller once the operation on each object became more time consuming.

baruch大约 11 年前

There is also the topic of Cache Oblivious data structures. Sadly, I'm still trying to figure that one out but it looks like it could help. Both with the memory and cache access time disparities and it is claimed that also with the memory to disk disparities.If someone knows of an easier to digest explanation than the proof-laden academic articles I'd appreciate a pointer :-)

agumonkey大约 11 年前

<a href="http://channel9.msdn.com/Events/Build/2014/2-661" rel="nofollow">http://channel9.msdn.com/Events/Build/2014/2-661</a> A talk by Herb Sutter about cache, ram prefetcher, locality/contiguous allocation. Very simple and informative.