This article gives the impression that everything is the compiler's fault when you end up with conflicting reads in different cores, but that's not right.<p>From the point of view of someone outside the CPU, yes you can say that simultaneous reads might never give different answers. But everything happening at that level barely resembles the original software. Dozens of instructions are happening at any moment, overlapping each other, starting and finishing in very different orders from how they're stored in memory. This happens even if you directly write machine code byte by byte.<p>From the point of view of the software, running a bunch of instructions in sequence, you do get stale values. You can have two threads wait for a signal, then both read a value, and both get different results. You can have a thread set a flag after it's done editing some values, have another thread wait for the flag, and then after waiting it sees the edits as still incomplete.<p>The exact types of nonsense depend on the memory model, but things as simple as two MOV instructions in a row can violate strict ordering. It's not just that the compiler might stash something in a register, but that the CPU itself will make swaths of very observable changes within the rules set by the memory model.<p>You can't trust the hardware coherency protocol to do much for you until you follow the platform-specific rules to <i>tell the CPU to make something act coherent</i>.
My favourite myth is that I still believe it's possible to write code which operates totally out of L1 and L2 cache. Not just tight ASM on bare metal: C code or similar, compiled down, to run under a modern UNIX/POSIX os on a multi-core host.<p>I have never explored HOW this would work, or WHAT I would do to achieve it, but I believe it, implicitly. For it to be true I would have to understand the implications of every function and procedure call, stack operations, actual size of objects/structs under alignment, optimisations in the assembler to align or select code which runs faster in the given ALU, none of which I know (if I ever did, certainly not any more)<p>But still: I believe this myth. I believe I know somebody who has achieved it, to test some ideas about CPU's ability to saturate a NIC at wire-rate, with pre-formed packets. He was able to show by binding affinity to specific cores and running this code, he could basically flood any link his CPU was capable of being exposed to, for a given PCI generation of NIC speeds available to him. (as I understand it) -But that assumes that walking off L2 cache would somehow make it run SLOW enough, to not be able to do this.<p>So I think it remains a myth, to me.
Hmm, I think the discussion is missing a few things needed for a complete picture of the situation.<p>First, ARM and x86 coherency models differ, so a big disclaimer is needed regarding the protocol. Most ARM processors use the MOESI protocol instead of the MESI protocol.<p>Second, synchronization isn't just because of register volatility and such. Synchronization is needed in general because without the appropriate lock/barrier instructions, compilers make assumptions about how loads and stores may be reordered with respect to one another.
We worked with the Intel guys, in my last gig. They were incredibly helpful.<p>They were an impressive lot, and they helped us out, quite a bit. They often sent engineers over, for weeks at a time, to help us optimize.<p>The cache thing was a 100X improvement thing, and it came from the oddest places. There's a lot of "that doesn't make sense!" stuff, with preserving caches.<p>I don't remember all the tricks, but we were constantly surprised.<p>One thing that saved us, was instrumentation. Intel had a bunch of utilities that they wrote, and that kept showing us that the clever thing we did, was not so clever.
Dealing with caches, memory ordering, and memory barriers can be truly mind-warping stuff, even for those who have spent years dealing with basic cache coherency before. If you want a challenge, try to absorb all this in one sitting.<p><a href="https://www.kernel.org/doc/Documentation/memory-barriers.txt" rel="nofollow noreferrer">https://www.kernel.org/doc/Documentation/memory-barriers.txt</a><p>I kept an earlier version of this close to hand at all times a couple of jobs ago where we were using our own chips with a very weak memory ordering. The implementation team had mostly come from Alpha, which had the weakest memory ordering ever, and in the intervening years a lot of bugs related to missing memory barriers had crept into the kernel because nobody was using anything nearly as weak. I specifically remember at least one in NBD, at least one in NFS, and many in Lustre. Pain in the ass to debug, because by the time you can look at anything the values have "settled" and seem correct.<p>For extra fun, as weak as the memory ordering was, the first run of chips didn't even get that right. LL/SC wouldn't work reliably unless the LL was issued twice, so we actually modified compilers to do that. Ew.
The central myth is that the average programmer should care.<p>The typical programmer should treat CPU caches as what they are designed to be: mostly transparent. You work in a high level language and leave the tricky details to a library and your compiler.<p>It's only a small minority that should really worry about these things.<p>In my daily work, I see more often premature microoptimizations (in part using the myths from the article) which are entirely unnecessary rather than code that needs to optimize for those things.
Loved the article, appreciate the share. A meta topic but I wish people would spend more time discussing what they like about something than the opposite.<p>This author took great pains to give many caveats around their writing and distilled down a very complex topic into something that could be consumed in less than 10 minutes.<p>I'm smarter for having read this.<p>Best of luck to any poor soul who attempts to summarize the work of many architects and then share it with this group of people.
SQLite database locks can be more quickly obtained and released if CPU affinity is set for the database processes, allowing all I/O activity to share the same cache(es).<p>I have read (but cannot remember where) that this can increase performance by thousands of DML operations per second.
On the related topic of store buffers and write coalescing, can anyone point me to some good articles discussing how this works in practice for merging small writes to the same cache line? I have a workload where we are appending different size payloads (of sizes between 1 and 64 bytes) to an in memory log. One option is to just store each entry in its own cache line. A more space efficient option is to pack each write tightly one after the other. My hypothesis is that due to write coalescing in store buffers this will also be more memory bandwidth efficient since writes to the same cache line will be merged. Is this correct? Note that metadata for each entry (e.g. the size) will be stored separately.
In Java, threads that are accessing non synchronized variables can read different information. It's in the jvm spec, and was a surprise back then. In C you could assume (at least in some runtimes) that you only need to sync on write.
Language-specific synchronization primitives such as mutaxes or volatile variable use memory barries to ensure the caches are flushed and/or a CPU core reads from the main memory directly:
<a href="https://en.wikipedia.org/wiki/Memory_barrier" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Memory_barrier</a>