Myths Programmers Believe about CPU Caches (2018)

176 pointsby whackalmost 2 years ago

13 comments

Dylan16807almost 2 years ago

This article gives the impression that everything is the compiler's fault when you end up with conflicting reads in different cores, but that's not right.From the point of view of someone outside the CPU, yes you can say that simultaneous reads might never give different answers. But everything happening at that level barely resembles the original software. Dozens of instructions are happening at any moment, overlapping each other, starting and finishing in very different orders from how they're stored in memory. This happens even if you directly write machine code byte by byte.From the point of view of the software, running a bunch of instructions in sequence, you do get stale values. You can have two threads wait for a signal, then both read a value, and both get different results. You can have a thread set a flag after it's done editing some values, have another thread wait for the flag, and then after waiting it sees the edits as still incomplete.The exact types of nonsense depend on the memory model, but things as simple as two MOV instructions in a row can violate strict ordering. It's not just that the compiler might stash something in a register, but that the CPU itself will make swaths of very observable changes within the rules set by the memory model.You can't trust the hardware coherency protocol to do much for you until you follow the platform-specific rules to tell the CPU to make something act coherent.

评论 #36335540 未加载

评论 #36336174 未加载

评论 #36338159 未加载

评论 #36335367 未加载

ggmalmost 2 years ago

My favourite myth is that I still believe it's possible to write code which operates totally out of L1 and L2 cache. Not just tight ASM on bare metal: C code or similar, compiled down, to run under a modern UNIX/POSIX os on a multi-core host.I have never explored HOW this would work, or WHAT I would do to achieve it, but I believe it, implicitly. For it to be true I would have to understand the implications of every function and procedure call, stack operations, actual size of objects/structs under alignment, optimisations in the assembler to align or select code which runs faster in the given ALU, none of which I know (if I ever did, certainly not any more)But still: I believe this myth. I believe I know somebody who has achieved it, to test some ideas about CPU's ability to saturate a NIC at wire-rate, with pre-formed packets. He was able to show by binding affinity to specific cores and running this code, he could basically flood any link his CPU was capable of being exposed to, for a given PCI generation of NIC speeds available to him. (as I understand it) -But that assumes that walking off L2 cache would somehow make it run SLOW enough, to not be able to do this.So I think it remains a myth, to me.

评论 #36336188 未加载

评论 #36334722 未加载

评论 #36358989 未加载

评论 #36334745 未加载

评论 #36339255 未加载

评论 #36339458 未加载

评论 #36336043 未加载

评论 #36334888 未加载

评论 #36336235 未加载

ninepointsalmost 2 years ago

Hmm, I think the discussion is missing a few things needed for a complete picture of the situation.First, ARM and x86 coherency models differ, so a big disclaimer is needed regarding the protocol. Most ARM processors use the MOESI protocol instead of the MESI protocol.Second, synchronization isn't just because of register volatility and such. Synchronization is needed in general because without the appropriate lock/barrier instructions, compilers make assumptions about how loads and stores may be reordered with respect to one another.

评论 #36335214 未加载

评论 #36335117 未加载

ChrisMarshallNYalmost 2 years ago

We worked with the Intel guys, in my last gig. They were incredibly helpful.They were an impressive lot, and they helped us out, quite a bit. They often sent engineers over, for weeks at a time, to help us optimize.The cache thing was a 100X improvement thing, and it came from the oddest places. There's a lot of "that doesn't make sense!" stuff, with preserving caches.I don't remember all the tricks, but we were constantly surprised.One thing that saved us, was instrumentation. Intel had a bunch of utilities that they wrote, and that kept showing us that the clever thing we did, was not so clever.

评论 #36335046 未加载

评论 #36337999 未加载

notacowardalmost 2 years ago

Dealing with caches, memory ordering, and memory barriers can be truly mind-warping stuff, even for those who have spent years dealing with basic cache coherency before. If you want a challenge, try to absorb all this in one sitting.<a href="https://www.kernel.org/doc/Documentation/memory-barriers.txt" rel="nofollow noreferrer">https://www.kernel.org/doc/Documentation/memory-barriers.txt</a>I kept an earlier version of this close to hand at all times a couple of jobs ago where we were using our own chips with a very weak memory ordering. The implementation team had mostly come from Alpha, which had the weakest memory ordering ever, and in the intervening years a lot of bugs related to missing memory barriers had crept into the kernel because nobody was using anything nearly as weak. I specifically remember at least one in NBD, at least one in NFS, and many in Lustre. Pain in the ass to debug, because by the time you can look at anything the values have "settled" and seem correct.For extra fun, as weak as the memory ordering was, the first run of chips didn't even get that right. LL/SC wouldn't work reliably unless the LL was issued twice, so we actually modified compilers to do that. Ew.

评论 #36335431 未加载

garfieldandthealmost 2 years ago

The central myth is that the average programmer should care.The typical programmer should treat CPU caches as what they are designed to be: mostly transparent. You work in a high level language and leave the tricky details to a library and your compiler.It's only a small minority that should really worry about these things.In my daily work, I see more often premature microoptimizations (in part using the myths from the article) which are entirely unnecessary rather than code that needs to optimize for those things.

评论 #36337367 未加载

评论 #36337095 未加载

101011almost 2 years ago

Loved the article, appreciate the share. A meta topic but I wish people would spend more time discussing what they like about something than the opposite.This author took great pains to give many caveats around their writing and distilled down a very complex topic into something that could be consumed in less than 10 minutes.I'm smarter for having read this.Best of luck to any poor soul who attempts to summarize the work of many architects and then share it with this group of people.

chasilalmost 2 years ago

SQLite database locks can be more quickly obtained and released if CPU affinity is set for the database processes, allowing all I/O activity to share the same cache(es).I have read (but cannot remember where) that this can increase performance by thousands of DML operations per second.

评论 #36359188 未加载

anonymousDanalmost 2 years ago

On the related topic of store buffers and write coalescing, can anyone point me to some good articles discussing how this works in practice for merging small writes to the same cache line? I have a workload where we are appending different size payloads (of sizes between 1 and 64 bytes) to an in memory log. One option is to just store each entry in its own cache line. A more space efficient option is to pack each write tightly one after the other. My hypothesis is that due to write coalescing in store buffers this will also be more memory bandwidth efficient since writes to the same cache line will be merged. Is this correct? Note that metadata for each entry (e.g. the size) will be stored separately.

评论 #36338085 未加载

mannyvalmost 2 years ago

In Java, threads that are accessing non synchronized variables can read different information. It's in the jvm spec, and was a surprise back then. In C you could assume (at least in some runtimes) that you only need to sync on write.

winstonprivacyalmost 2 years ago

Ok, interesting and all but just tell me how I can get 1 ns cross thread access to variables in C++

评论 #36342861 未加载

评论 #36339228 未加载

Nuzzerinoalmost 2 years ago

2018

dmagdaalmost 2 years ago

Language-specific synchronization primitives such as mutaxes or volatile variable use memory barries to ensure the caches are flushed and/or a CPU core reads from the main memory directly: <a href="https://en.wikipedia.org/wiki/Memory_barrier" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Memory_barrier</a>