We published a paper where we captured the same kind of insights (deep numa hierarchies including cache levels, numa nodes, packages) and used them to tailor spinlocks to the underlying machine: <a href="https://dl.acm.org/doi/10.1145/3477132.3483557" rel="nofollow">https://dl.acm.org/doi/10.1145/3477132.3483557</a>
This is a cool project.<p>It looks kinda like the color scales are normalized to just-this-CPU's latency? It would be neater if the scale represented the same values among CPUs. Or rather, it would be neat if there were an additional view for this data that could make it easier to compare among them.<p>I think the differences are really interesting to consider. What if the scheduler could consider these designs when weighing how to schedule each task? Either statically or somehow empirically? I think I've seen sysfs info that describes the cache hierarchies, so maybe some of this info is available already. That nest [1] scheduler was recently shared on HN, I suppose it may be taking advantage of some of these properties.<p>[1] <a href="https://dl.acm.org/doi/abs/10.1145/3492321.3519585" rel="nofollow">https://dl.acm.org/doi/abs/10.1145/3492321.3519585</a>
I have something similar but in C++: <a href="https://github.com/rigtorp/c2clat" rel="nofollow">https://github.com/rigtorp/c2clat</a>
I was wondering what real-life situations this benchmark matters the most in, then I remembered... A few years ago I was working on a uni research project trying to eek out the most performance possible in an x86 software-defined EPC, basically the gateway that sits between the LTE cell tower intranet and the rest of the Internet. The important part for me to optimize was the control plane, which handles handshakes between end users and the gateway (imagine everyone spam-toggling airplane mode when their LTE drops). Cache coherence latency was a bottleneck. The control plane I developed had diminishing returns in throughput up to like 8 cores on a 12-core CPU in our dual-socket test machine. Beyond that, adding cores actually slowed it down significantly*. Not a single-threaded task but not embarrassingly parallel either. The data plane was more parallel, and it ran on a separate NUMA node. Splitting either across NUMA nodes destroyed the performance.<p>* which in hindsight sounds like TurboBoost was enabled, but I vaguely remember it being disabled in tests
>This software is licensed under the MIT license<p>Maybe consider including a MIT license in the repository.<p>Legally, that's a bit more sane than having a line in the readme.<p>In practice, github will recognize your license file and show the license in the indexes an d in the right column of your repository's main page.
It would be interesting to have a more detailed understanding of why these are the latencies, e.g. this repo has ‘clusters’ but there is surely some architectural reason for these clusters. Is it just physical distance on the chip or is there some other design constraint?<p>I find it pretty interesting where the interface that cpu makers present (eg a bunch of equal cores) breaks down.
I am currently working on my master's degree on computer science and studying on this exact topic.<p>In order to measure core-to-core latency, we should also learn how the cache coherence works on Intel. I am currently experimenting with microbenchmarks on Skylake microarchitecture. Due to the scalability issues with ring interconnect on CPU dies in previous models, Intel opted for 2D mesh interconnect microarchitecture in recent years. In this microarchitecture, CPU die is split into tiles each accommodating cores, caches, CHA, snoop filter etc. I want to emphasize the role of CHA here. Each CHA is responsible for managing coherence of a portion of the addresses. If a core tries to fetch a variable that is not in its L1D or L2 cache, the CHA managing the coherence of the address of the variable being fetched will be queried to learn whereabouts of the variable. If the data is on the die, the core currently owning the variable will be told to forward that variable to the requesting core. So, even though the cores that communicate with each other are physically contiguous, the location of the CHA that manages the coherence of the variable they will pass back and forth also is important due to cache coherence mechanism.<p>Related links:<p><a href="https://gac.udc.es/~gabriel/files/DAC19-preprint.pdf" rel="nofollow">https://gac.udc.es/~gabriel/files/DAC19-preprint.pdf</a><p><a href="https://par.nsf.gov/servlets/purl/10278043" rel="nofollow">https://par.nsf.gov/servlets/purl/10278043</a>
Here is AMD Ryzen 9 5900x on Windows 11<p><a href="https://gist.github.com/smarkwell/d72deee656341d53dff469df2bcc6547" rel="nofollow">https://gist.github.com/smarkwell/d72deee656341d53dff469df2b...</a>
I've been doing some latency measurements like this, but between two processes using unix domain sockets. I'm measuring more on the order of 50uS on average, when using FIFO RT scheduling. I suspect the kernel is either letting processes linger for a little bit, or perhaps the "idle" threads tend to call into the kernel and let it do some non-preemptable book keeping.<p>If I crank up the amount of traffic going through the sockets, the average latency drops, presumably due to the processes being able to batch together multiple packets rather than having to block on each one.
This is a fascinating insight into a subsystem which we take from granted and naively assume is homogeneous. Thank you so much for sharing.<p>A request to the community - I am particularly interested in the Apple M1 Ultra. Apple made a pretty big fuss about the transparency of their die-to-die interconnect in the M1 Ultra. So, it would be very interesting to see what happens with it - both on Mac OS and (say, Asahi) Linux.
This benchmark reminds me of "ffwd: delegation is (much) faster than you think" <a href="https://www.seltzer.com/margo/teaching/CS508-generic/papers-a1/roghanchi17.pdf" rel="nofollow">https://www.seltzer.com/margo/teaching/CS508-generic/papers-...</a>.<p>This paper describes a mechanism for client threads pinned to a distinct cores to delegate a function call to distinguished server thread pinned to its own core all on the same socket.<p>This has a multitude of applications the most obvious one making a shared data structure MT safe through delegation rather than saddling it with mutexes or other synchronization points especially beneficial with small critical sections.<p>The paper's abstract concludes claiming "100% [improvement] over the next best solution tested (RCL), and multiple micro-benchmarks show improvements in the 5–10× range."<p>The code does delegation without CAS, locks, or atomics.<p>The efficacy of such a scheme rests on two facets, which the paper explains:<p>* Modern CPUs can move GBs/second between core L2/LLC caches<p>* The synchronization between requesting clients and responding servers depends on each side spinning on shared memory address looking for bit toggles. Briefly, servers only read client request memory which the client only writes. (Clients each have their own slot). And on the response side client's read the servers shared response memory, which only the server writes. This one-side read, one-side write is supposed to minimize the number of cache invalidations and MESI syncs.<p>I spent some time testing the author's code and went so far as writing my own version. I was never able to make it work with anywhere near the throughput claimed in the paper. There's also some funny "nop" assembler instructions within the code that I gather is a cheap form of thread yielding.<p>In fact this relatively simple SPCP MT ring buffer which has but a fraction of the code:<p><a href="https://rigtorp.se/ringbuffer/" rel="nofollow">https://rigtorp.se/ringbuffer/</a><p>did far, far better.<p>In my experiments then CPU spun too quickly so that core-to-core bandwidth was quickly squandered before the server could signal response or the client could signal request. I wonder if adding select atomic reads as with the SPSC ring might help.