Myths Programmers Believe about CPU Caches (2018)

366 pointsby noegoover 5 years ago

18 comments

strstrover 5 years ago

If cache coherence is relevant to you, I strongly recommend the book “A Primer on Memory Consistency and Cache Coherence”. It’s much easier to understand the details of coherency from a broader perspective, than an incremental read-a-bunch-of-blogs perspective.I found that book very readable, and it cleared up most misconceptions I had. It also teaches a universal vocabulary for discussing coherency/consistency, which is useful for conveying the nuances of the topic.Cache coherence is not super relevant to most programmers though. Every language provides an abstraction on top of caches, and nearly every language uses the “data race free -> sequentially consistent”. Having an understanding of data races and sequential consistency is much more important than understanding caching: the compiler/runtime has more freedom to mess with your code than your CPU (unless you are on something like the DEC Alpha, which you probably aren't).If you are writing an OS/Hypervisor/Compiler (or any other situation where you touch asm), cache coherence is a subject you probably need a solid grasp of.

评论 #21593251 未加载

评论 #21590542 未加载

评论 #21591139 未加载

评论 #21595079 未加载

dragontamerover 5 years ago

A good, introductory, high-level overview of what is going on with cache coherence... albeit specific to x86.ARM systems are more relaxed, and therefore need more barriers than on x86. Memory barriers (which also function as "compiler barriers" for the memory / register thing discussed in the article) are handled as long as you properly use locks (or other synchronization primitives like semaphores or mutexes).Its good to know how things work "under the covers" for performance reasons at least. Especially if you ever write a lock-free data-structure (not allowed to use... well... mutexes or locks), so you need to place the barriers in the appropriate spot.------I think the acquire/release model of consistency will become more important in the coming years. PCIe 4.0 is showing signs of supporting acquire/release... ARM and POWER have added acquire/release model instructions, and even CUDA has acquire/release semantics being built.As high-performance code demands faster-and-faster systems, the more-and-more relaxed our systems will become. Acquire/release is quickly becoming the standard model.

评论 #21593129 未加载

评论 #21592028 未加载

评论 #21589777 未加载

sherincallover 5 years ago

One thing not mentioned here (nor in previous discussions of the article, it seems) is that DMA is typically not coherent with the CPU caches. This is kinda visible from the little diagram at the top, with the disk sitting on the other side of the memory, but it should be explicitly spelled out. If you're using a DMA device (memory<->device or memory<->memory copies), you might end up in a state where the DMA and the CPU see different values. This usually means data transfer to/from a Disk or GPU, though other peripherals might use it too.Your options here are either to manually invalidate your caches and synchronize with the DMA (e.g. via interrupts), or to request from the OS that the given memory section be entirely uncached; or in some cases, you can get away with a write-through cache policy, if the DMA is only ever reading the memory.

评论 #21598864 未加载

lettergramover 5 years ago

For those interested (in 2014!) I did a rather simple analysis of CPU caches and for loops to point out some pitfalls:<a href="https://austingwalters.com/the-cache-and-multithreading/" rel="nofollow">https://austingwalters.com/the-cache-and-multithreading/</a>Hope it helps someone, I tend to link it to my co-workers when they ask me why I PR'd re-ordering of loops & functions OR when they ask how I get speedups without changing functionality.

nemothekidover 5 years ago

I've never heard of the MESI protocol before so that was really interesting to read, and I liked the comparison to distributed systems.I'm wondering if the MESI protocol could be used in a networked database manner? I feel like you need master node though to coordinate everything though (like the L2 does in the example).

评论 #21589731 未加载

评论 #21589944 未加载

dakomover 5 years ago

Something I don't understand is how to deal with cache coherency when you need the same data in a bunch of different configurations.Take a typical game loop and assume we have a list of Transforms (e.g. world matrix, translation/rotation/scale, whatever - each Transform is a collection of floats in contiguous memory)Different systems that run in that loop need those transforms in different orders. Rendering may want to organize it by material (to avoid shader switching), AI may want to organize it by type of state machine, Collision by geometric proximity, Culling for physics and lighting might be different, and the list goes on.Naive answer is "just duplicate the transforms when they are updated" but that introduces its own complexity and comes at its own cost of fetching data from the cache.I guess what I'm getting at is:1) I would love to learn more about how this problem is tackled by robust game engines (I guess games too - but games have more specific knowledge than engines and can have unique code to handle it)2) Does it all just come out in the wash at the end of the day? Not saying just throw it all on the heap and don't care... but, maybe say optimizing for one path, like "updating world transforms for rendering", is worth it and then whatever the cost is to jump around elsewhere doesn't really matter?Sorry if my question is a bit vague... any insight is appreciated

评论 #21595971 未加载

dangover 5 years ago

Discussed last year: <a href="https://news.ycombinator.com/item?id=17670095" rel="nofollow">https://news.ycombinator.com/item?id=17670095</a>

vagab0ndover 5 years ago

This might be a naive question: how did we decide as an industry that cache should be controlled by the hardware, but registers and main memory by the compiler?

评论 #21594808 未加载

评论 #21596639 未加载

评论 #21596067 未加载

nostrademonsover 5 years ago

Anyone else start thinking of Rust's mutable/immutable borrow system when reading the MESI algorithm? It's not quite the same - with Rust, mutable borrows are never shared, and you can never mutate a shared read-only borrow - but the principle seems like a simplification of the full MESI protocol.It seems like this would be generally applicable for a wide variety of distributed & concurrent applications.

评论 #21589558 未加载

tyingqover 5 years ago

AMD's Rome processors are an interesting case, with 8MB of L3 cache per core. So the 64 core processor has 512MB of L3 cache. It wasn't that long ago that 512MB was a respectable amount of DRAM in a big server. An early 90's fridge sized Sun 690MP maxed out at 1GB of DRAM and had 1MB of L2 cache, no L3.

评论 #21591017 未加载

评论 #21592408 未加载

musicaleover 5 years ago

> “different cores can have different/stale values in their individual caches”.Different processes can certainly have different versions of the same state, different values for the same variable, and different values at the same virtual address.And what about virtual caches? Non-coherent cache lines?Moreover, even in the face of cache coherency you can still have race conditions.

评论 #21592714 未加载

praptakover 5 years ago

The one-line summary seems to be that one should never worry about caches themselves introducing concurrency bugs.I mean after we account for memory operations reordering on each core, the memory address storing a single value that is visible to all cores is a correct model from the concurrency-correctness point of view, right?

评论 #21592330 未加载

blattimwindover 5 years ago

It's worth pointing out that the L1 cache and its associated logic is the only* way a core talks to the outside world, including all I/O ever. With that in mind it is easy to understand why it is so crucial to performance.* there might be some minor exceptions

wildmanxover 5 years ago

The biggest myth is that any of this matters to anybody but a tiny fraction of niche programmers.The reason some "app" is slow is not because of cache coherence traffic. It's because somebody chose the wrong data structure, created some stupid architecture, wrote incomprehensible code that the next guy extended in a horribly inefficient way cause they didn't understand the original. My web browsing experience is slow because people include heaps of bloated JS crap and trackers and ad networks so I have to load 15 megabytes of nonsense and wait for JS to render stuff I don't want to see. None of this was any better if anybody involved understood CPU caches better.Even in the kernel or HPC applications, most code is not in the hot path. Programmers should rather focus on clean architectures and understandable code. How does it help if you hand-optimize some stuff that nobody understands, nobody can fix a bug, nobody can extend it. That's horrible code, even if it's 5% faster than what somebody else wrote.TL;DR: This is interesting, but likely totally irrelevant to your day job. In the list of things to do to improve your code, it comes so far down that you'll never get there.

gchokovover 5 years ago

Half a decade - woow! Sounds like the author has spent half a century there..

johnthescottover 5 years ago

the entire design of unix is realized in a moment when a motherboard is seen as a network.

1e1fover 5 years ago

Should include tl;dr your concurrency fears are real, but for registers and not caches.

评论 #21591385 未加载

mrichover 5 years ago

Note that this is quite specific to x86, on other architectures like Power there are much weaker guarantees that will lead to problems when assuming the same model.

评论 #21595950 未加载

评论 #21594813 未加载