If cache coherence is relevant to you, I strongly recommend the book “A Primer on Memory Consistency and Cache Coherence”. It’s much easier to understand the details of coherency from a broader perspective, than an incremental read-a-bunch-of-blogs perspective.<p>I found that book very readable, and it cleared up most misconceptions I had. It also teaches a universal vocabulary for discussing coherency/consistency, which is useful for conveying the nuances of the topic.<p>Cache coherence is not super relevant to most programmers though. Every language provides an abstraction on top of caches, and nearly every language uses the “data race free -> sequentially consistent”. Having an understanding of data races and sequential consistency is much more important than understanding caching: the compiler/runtime has more freedom to mess with your code than your CPU (unless you are on something like the DEC Alpha, which you probably aren't).<p>If you are writing an OS/Hypervisor/Compiler (or any other situation where you touch asm), cache coherence is a subject you probably need a solid grasp of.
A good, introductory, high-level overview of what is going on with cache coherence... albeit specific to x86.<p>ARM systems are more relaxed, and therefore need more barriers than on x86. Memory barriers (which also function as "compiler barriers" for the memory / register thing discussed in the article) are handled as long as you properly use locks (or other synchronization primitives like semaphores or mutexes).<p>Its good to know how things work "under the covers" for performance reasons at least. Especially if you ever write a lock-free data-structure (not allowed to use... well... mutexes or locks), so you need to place the barriers in the appropriate spot.<p>------<p>I think the acquire/release model of consistency will become more important in the coming years. PCIe 4.0 is showing signs of supporting acquire/release... ARM and POWER have added acquire/release model instructions, and even CUDA has acquire/release semantics being built.<p>As high-performance code demands faster-and-faster systems, the more-and-more relaxed our systems will become. Acquire/release is quickly becoming the standard model.
One thing not mentioned here (nor in previous discussions of the article, it seems) is that DMA is typically not coherent with the CPU caches. This is kinda visible from the little diagram at the top, with the disk sitting on the other side of the memory, but it should be explicitly spelled out. If you're using a DMA device (memory<->device or memory<->memory copies), you might end up in a state where the DMA and the CPU see different values. This usually means data transfer to/from a Disk or GPU, though other peripherals might use it too.<p>Your options here are either to manually invalidate your caches and synchronize with the DMA (e.g. via interrupts), or to request from the OS that the given memory section be entirely uncached; or in some cases, you can get away with a write-through cache policy, if the DMA is only ever reading the memory.
For those interested (in 2014!) I did a rather simple analysis of CPU caches and for loops to point out some pitfalls:<p><a href="https://austingwalters.com/the-cache-and-multithreading/" rel="nofollow">https://austingwalters.com/the-cache-and-multithreading/</a><p>Hope it helps someone, I tend to link it to my co-workers when they ask me why I PR'd re-ordering of loops & functions OR when they ask how I get speedups without changing functionality.
I've never heard of the MESI protocol before so that was really interesting to read, and I liked the comparison to distributed systems.<p>I'm wondering if the MESI protocol could be used in a networked database manner? I feel like you need master node though to coordinate everything though (like the L2 does in the example).
Something I don't understand is how to deal with cache coherency when you need the same data in a bunch of different configurations.<p>Take a typical game loop and assume we have a list of Transforms (e.g. world matrix, translation/rotation/scale, whatever - each Transform is a collection of floats in contiguous memory)<p>Different systems that run in that loop need those transforms in different orders. Rendering may want to organize it by material (to avoid shader switching), AI may want to organize it by type of state machine, Collision by geometric proximity, Culling for physics and lighting might be different, and the list goes on.<p>Naive answer is "just duplicate the transforms when they are updated" but that introduces its own complexity and comes at its own cost of fetching data from the cache.<p>I guess what I'm getting at is:<p>1) I would love to learn more about how this problem is tackled by robust game engines (I guess games too - but games have more specific knowledge than engines and can have unique code to handle it)<p>2) Does it all just come out in the wash at the end of the day? Not saying just throw it all on the heap and don't care... but, maybe say optimizing for one path, like "updating world transforms for rendering", is worth it and then whatever the cost is to jump around elsewhere doesn't really matter?<p>Sorry if my question is a bit vague... any insight is appreciated
This might be a naive question: how did we decide as an industry that cache should be controlled by the hardware, but registers and main memory by the compiler?
Anyone else start thinking of Rust's mutable/immutable borrow system when reading the MESI algorithm? It's not quite the same - with Rust, mutable borrows are never shared, and you can never mutate a shared read-only borrow - but the principle seems like a simplification of the full MESI protocol.<p>It seems like this would be generally applicable for a wide variety of distributed & concurrent applications.
AMD's Rome processors are an interesting case, with 8MB of L3 cache per core. So the 64 core processor has 512MB of L3 cache. It wasn't that long ago that 512MB was a respectable amount of DRAM in a big server. An early 90's fridge sized Sun 690MP maxed out at 1GB of DRAM and had 1MB of L2 cache, no L3.
> “different cores can have different/stale values in their individual caches”.<p>Different processes can certainly have different versions of the same state, different values for the same variable, and different values at the same virtual address.<p>And what about virtual caches? Non-coherent cache lines?<p>Moreover, even in the face of cache coherency you can still have race conditions.
The one-line summary seems to be that one should never worry about caches themselves introducing concurrency bugs.<p>I mean after we account for memory operations reordering on each core, the memory address storing a single value that is visible to all cores is a correct model from the concurrency-correctness point of view, right?
It's worth pointing out that the L1 cache and its associated logic is the only* way a core talks to the outside world, including all I/O ever. With that in mind it is easy to understand why it is so crucial to performance.<p>* there might be some minor exceptions
The biggest myth is that any of this matters to anybody but a tiny fraction of niche programmers.<p>The reason some "app" is slow is not because of cache coherence traffic. It's because somebody chose the wrong data structure, created some stupid architecture, wrote incomprehensible code that the next guy extended in a horribly inefficient way cause they didn't understand the original. My web browsing experience is slow because people include heaps of bloated JS crap and trackers and ad networks so I have to load 15 megabytes of nonsense and wait for JS to render stuff I don't want to see. None of this was any better if anybody involved understood CPU caches better.<p>Even in the kernel or HPC applications, most code is not in the hot path. Programmers should rather focus on clean architectures and understandable code. How does it help if you hand-optimize some stuff that nobody understands, nobody can fix a bug, nobody can extend it. That's horrible code, even if it's 5% faster than what somebody else wrote.<p>TL;DR: This is interesting, but likely totally irrelevant to your day job. In the list of things to do to improve your code, it comes so far down that you'll never get there.
Note that this is quite specific to x86, on other architectures like Power there are much weaker guarantees that will lead to problems when assuming the same model.