Measuring CPU core-to-core latency

319 点作者 nviennot超过 2 年前

24 条评论

apaolillo超过 2 年前

We published a paper where we captured the same kind of insights (deep numa hierarchies including cache levels, numa nodes, packages) and used them to tailor spinlocks to the underlying machine: <a href="https://dl.acm.org/doi/10.1145/3477132.3483557" rel="nofollow">https://dl.acm.org/doi/10.1145/3477132.3483557</a>

评论 #32892675 未加载

wyldfire超过 2 年前

This is a cool project.It looks kinda like the color scales are normalized to just-this-CPU's latency? It would be neater if the scale represented the same values among CPUs. Or rather, it would be neat if there were an additional view for this data that could make it easier to compare among them.I think the differences are really interesting to consider. What if the scheduler could consider these designs when weighing how to schedule each task? Either statically or somehow empirically? I think I've seen sysfs info that describes the cache hierarchies, so maybe some of this info is available already. That nest [1] scheduler was recently shared on HN, I suppose it may be taking advantage of some of these properties.[1] <a href="https://dl.acm.org/doi/abs/10.1145/3492321.3519585" rel="nofollow">https://dl.acm.org/doi/abs/10.1145/3492321.3519585</a>

评论 #32893169 未加载

评论 #32892826 未加载

rigtorp超过 2 年前

I have something similar but in C++: <a href="https://github.com/rigtorp/c2clat" rel="nofollow">https://github.com/rigtorp/c2clat</a>

评论 #32891741 未加载

评论 #32894435 未加载

评论 #32895144 未加载

评论 #32899633 未加载

hot_gril超过 2 年前

I was wondering what real-life situations this benchmark matters the most in, then I remembered... A few years ago I was working on a uni research project trying to eek out the most performance possible in an x86 software-defined EPC, basically the gateway that sits between the LTE cell tower intranet and the rest of the Internet. The important part for me to optimize was the control plane, which handles handshakes between end users and the gateway (imagine everyone spam-toggling airplane mode when their LTE drops). Cache coherence latency was a bottleneck. The control plane I developed had diminishing returns in throughput up to like 8 cores on a 12-core CPU in our dual-socket test machine. Beyond that, adding cores actually slowed it down significantly*. Not a single-threaded task but not embarrassingly parallel either. The data plane was more parallel, and it ran on a separate NUMA node. Splitting either across NUMA nodes destroyed the performance.* which in hindsight sounds like TurboBoost was enabled, but I vaguely remember it being disabled in tests

评论 #32894185 未加载

jtorsella超过 2 年前

If anyone is interested, here are the results on my M1 Pro running Asahi Linux:Min: 48.3 Max: 175.0 Mean: 133.0I’ll try to copy the exact results once I have a browser on Asahi, but the general pattern is most pairs have >150ns and a few (0-1; 2-3,4,5; 3-4,5; 4-5; 6-7,8,9; 7-8,9; 8-9) are faster at about 50ns.Edit: The results from c2clat (a little slower, but the format is nicer) are below.CPU 0 1 2 3 4 5 6 7 8 9<pre><code> 0 0 59 231 205 206 206 208 219 210 210 1 59 0 205 215 207 207 209 209 210 210 2 231 205 0 40 42 43 180 222 224 213 3 205 215 40 0 43 43 212 222 213 213 4 206 207 42 43 0 44 182 227 217 217 5 206 207 43 43 44 0 215 215 217 217 6 208 209 180 212 182 215 0 40 43 45 7 219 209 222 222 227 215 40 0 43 43 8 210 210 224 213 217 217 43 43 0 44 9 210 210 213 213 217 217 45 43 44 0</code></pre>

评论 #32893939 未加载

评论 #32892310 未加载

评论 #32892566 未加载

snvzz超过 2 年前

>This software is licensed under the MIT licenseMaybe consider including a MIT license in the repository.Legally, that's a bit more sane than having a line in the readme.In practice, github will recognize your license file and show the license in the indexes an d in the right column of your repository's main page.

评论 #32894014 未加载

dan-robertson超过 2 年前

It would be interesting to have a more detailed understanding of why these are the latencies, e.g. this repo has ‘clusters’ but there is surely some architectural reason for these clusters. Is it just physical distance on the chip or is there some other design constraint?I find it pretty interesting where the interface that cpu makers present (eg a bunch of equal cores) breaks down.

评论 #32893114 未加载

评论 #32891811 未加载

评论 #32891530 未加载

评论 #32891605 未加载

ozcanay超过 2 年前

I am currently working on my master's degree on computer science and studying on this exact topic.In order to measure core-to-core latency, we should also learn how the cache coherence works on Intel. I am currently experimenting with microbenchmarks on Skylake microarchitecture. Due to the scalability issues with ring interconnect on CPU dies in previous models, Intel opted for 2D mesh interconnect microarchitecture in recent years. In this microarchitecture, CPU die is split into tiles each accommodating cores, caches, CHA, snoop filter etc. I want to emphasize the role of CHA here. Each CHA is responsible for managing coherence of a portion of the addresses. If a core tries to fetch a variable that is not in its L1D or L2 cache, the CHA managing the coherence of the address of the variable being fetched will be queried to learn whereabouts of the variable. If the data is on the die, the core currently owning the variable will be told to forward that variable to the requesting core. So, even though the cores that communicate with each other are physically contiguous, the location of the CHA that manages the coherence of the variable they will pass back and forth also is important due to cache coherence mechanism.Related links:<a href="https://gac.udc.es/~gabriel/files/DAC19-preprint.pdf" rel="nofollow">https://gac.udc.es/~gabriel/files/DAC19-preprint.pdf</a><a href="https://par.nsf.gov/servlets/purl/10278043" rel="nofollow">https://par.nsf.gov/servlets/purl/10278043</a>

评论 #32907989 未加载

moep0超过 2 年前

Why does CPU=8 in Intel Core i9-12900K have fast access to all other cores? It is interesting.

评论 #32893421 未加载

评论 #32894075 未加载

评论 #32904030 未加载

评论 #32893200 未加载

bhedgeoser超过 2 年前

On a 5950x, the latencies for core 0 are very high if SMT is enabled, I wonder why that is?<pre><code> 0 1 0 1 26±0 2 26±0 17±0 3 27±0 17±0 4 32±0 17±0 5 29±0 19±0 6 32±0 18±0 7 31±0 17±0 8 138±1 81±0 9 138±1 83±0 10 139±1 80±0 11 136±1 84±0 12 134±1 83±0 13 137±1 80±0 14 136±1 84±0 15 139±1 84±0 16 16±0 16±0 17 28±0 8±0 18 33±0 17±0 19 29±0 16±0 20 28±0 17±0 21 29±0 19±0 22 32±0 18±0 23 31±0 17±0 24 137±1 81±0 25 140±1 79±0 26 143±1 80±0 27 138±1 82±0 28 139±1 82±0 29 139±1 81±0 30 142±1 82±0 31 142±1 84±0</code></pre>

bullen超过 2 年前

I ran .c2clat on a Raspberry 4:<pre><code> CPU 0 1 2 3 0 0 77 77 77 1 77 0 77 77 2 77 77 0 77 3 77 77 77 0 </code></pre> And Raspberry 2:<pre><code> CPU 0 1 2 3 0 0 71 71 71 1 71 0 71 71 2 71 71 0 71 3 71 71 71 0</code></pre>

jeffbee超过 2 年前

Fails to build from source with Rust 1.59 so I tried the C++ `c2clat` from elsewhere in the thread. Quite interesting on Alder Lake, because the quartet of Atom cores has uniform latency (they share an L2 cache and other resources) while the core-to-core latency of the Core side of the CPU varies. Note that the way these are logically numbers is 0,1 are SMT threads of the first core and so forth through 14-15. 16-19 are Atom cores with 1 thread each.<pre><code> CPU 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 0 12 60 44 60 44 60 43 50 47 56 48 58 49 60 50 79 79 78 79 1 12 0 45 45 44 44 60 43 51 49 55 47 57 49 56 51 76 76 76 76 2 60 45 0 13 42 43 53 43 48 37 52 41 53 42 53 42 72 72 72 72 3 44 45 13 0 42 43 53 42 47 37 51 40 53 41 53 42 72 72 72 72 4 60 44 42 42 0 13 56 43 49 52 54 41 56 42 42 41 75 75 74 75 5 44 44 43 43 13 0 56 43 51 54 55 41 56 42 56 42 77 77 77 77 6 60 60 53 53 56 56 0 13 49 54 56 41 57 42 57 42 78 78 78 78 7 43 43 43 42 43 43 13 0 46 47 54 41 41 41 55 41 72 71 71 71 8 50 51 48 47 49 51 49 46 0 12 51 51 54 56 55 56 75 75 75 75 9 47 49 37 37 52 54 54 47 12 0 49 53 54 56 55 54 74 69 67 68 10 56 55 52 51 54 55 56 54 51 49 0 13 53 58 56 59 75 75 76 75 11 48 47 41 40 41 41 41 41 51 53 13 0 51 52 55 59 75 75 75 75 12 58 57 53 53 56 56 57 41 54 54 53 51 0 13 55 60 77 77 77 77 13 49 49 42 41 42 42 42 41 56 56 58 52 13 0 55 54 77 77 77 77 14 60 56 53 53 42 56 57 55 55 55 56 55 55 55 0 12 74 70 78 78 15 50 51 42 42 41 42 42 41 56 54 59 59 60 54 12 0 75 74 74 77 16 79 76 72 72 75 77 78 72 75 74 75 75 77 77 74 75 0 55 55 55 17 79 76 72 72 75 77 78 71 75 69 75 75 77 77 70 74 55 0 55 55 18 78 76 72 72 74 77 78 71 75 67 76 75 77 77 78 74 55 55 0 55 19 79 76 72 72 75 77 78 71 75 68 75 75 77 77 78 77 55 55 55 0</code></pre>

评论 #32894003 未加载

vladvasiliu超过 2 年前

This is interesting. I'm getting much worse results on an i7-1165G7 than the ones published:<pre><code> Num cores: 8 Using RDTSC to measure time: true Num round trips per samples: 5000 Num samples: 300 Showing latency=round-trip-time/2 in nanoseconds: 0 1 2 3 4 5 6 7 0 1 70±1 2 53±1 42±0 3 73±5 134±5 80±1 4 16±0 49±1 56±1 46±1 5 63±4 28±1 128±5 67±1 66±1 6 56±1 49±1 10±0 81±4 124±4 72±1 7 57±1 57±1 45±1 10±0 63±4 130±5 87±1 Min latency: 10.1ns ±0.2 cores: (6,2) Max latency: 134.1ns ±5.3 cores: (3,1) Mean latency: 64.7ns</code></pre>

评论 #32897471 未加载

mey超过 2 年前

Here is AMD Ryzen 9 5900x on Windows 11<a href="https://gist.github.com/smarkwell/d72deee656341d53dff469df2bcc6547" rel="nofollow">https://gist.github.com/smarkwell/d72deee656341d53dff469df2b...</a>

评论 #32894052 未加载

sgtnoodle超过 2 年前

I've been doing some latency measurements like this, but between two processes using unix domain sockets. I'm measuring more on the order of 50uS on average, when using FIFO RT scheduling. I suspect the kernel is either letting processes linger for a little bit, or perhaps the "idle" threads tend to call into the kernel and let it do some non-preemptable book keeping.If I crank up the amount of traffic going through the sockets, the average latency drops, presumably due to the processes being able to batch together multiple packets rather than having to block on each one.

评论 #32891458 未加载

mayapugai超过 2 年前

This is a fascinating insight into a subsystem which we take from granted and naively assume is homogeneous. Thank you so much for sharing.A request to the community - I am particularly interested in the Apple M1 Ultra. Apple made a pretty big fuss about the transparency of their die-to-die interconnect in the M1 Ultra. So, it would be very interesting to see what happens with it - both on Mac OS and (say, Asahi) Linux.

scrubs超过 2 年前

This benchmark reminds me of "ffwd: delegation is (much) faster than you think" <a href="https://www.seltzer.com/margo/teaching/CS508-generic/papers-a1/roghanchi17.pdf" rel="nofollow">https://www.seltzer.com/margo/teaching/CS508-generic/papers-...</a>.This paper describes a mechanism for client threads pinned to a distinct cores to delegate a function call to distinguished server thread pinned to its own core all on the same socket.This has a multitude of applications the most obvious one making a shared data structure MT safe through delegation rather than saddling it with mutexes or other synchronization points especially beneficial with small critical sections.The paper's abstract concludes claiming "100% [improvement] over the next best solution tested (RCL), and multiple micro-benchmarks show improvements in the 5–10× range."The code does delegation without CAS, locks, or atomics.The efficacy of such a scheme rests on two facets, which the paper explains:* Modern CPUs can move GBs/second between core L2/LLC caches* The synchronization between requesting clients and responding servers depends on each side spinning on shared memory address looking for bit toggles. Briefly, servers only read client request memory which the client only writes. (Clients each have their own slot). And on the response side client's read the servers shared response memory, which only the server writes. This one-side read, one-side write is supposed to minimize the number of cache invalidations and MESI syncs.I spent some time testing the author's code and went so far as writing my own version. I was never able to make it work with anywhere near the throughput claimed in the paper. There's also some funny "nop" assembler instructions within the code that I gather is a cheap form of thread yielding.In fact this relatively simple SPCP MT ring buffer which has but a fraction of the code:<a href="https://rigtorp.se/ringbuffer/" rel="nofollow">https://rigtorp.se/ringbuffer/</a>did far, far better.In my experiments then CPU spun too quickly so that core-to-core bandwidth was quickly squandered before the server could signal response or the client could signal request. I wonder if adding select atomic reads as with the SPSC ring might help.

jesse__超过 2 年前

This is absolutely the coolest thing I've seen in a while.

bullen超过 2 年前

When would cores talk to cores like this is measuring?Would two cores reading and writing to the same memory have this contention?

评论 #32895229 未加载

fideloper超过 2 年前

Because I’m ignorant: What are the practical take aways from this?When is a cpu core sending a message to another core?

评论 #32890685 未加载

评论 #32890483 未加载

评论 #32891259 未加载

评论 #32891307 未加载

评论 #32891457 未加载

zeristor超过 2 年前

I realise these were run on AWS instances, but could this be run locally on Apple Silicon?Erm, I guess I should try.

评论 #32893139 未加载

bee_rider超过 2 年前

Does anyone know what is up with the 8275CL? It looks... almost periodic or something.

评论 #32894249 未加载

stevefan1999超过 2 年前

What about OS scheduling overhead?

评论 #32893698 未加载

throaway53dh超过 2 年前

What's the diff b/w sockets and cores? Does socket have seperate Ln caches and cores share the ca he?

24 条评论

apaolillo超过 2 年前

评论 #32892675 未加载

wyldfire超过 2 年前

评论 #32893169 未加载

评论 #32892826 未加载

rigtorp超过 2 年前

I have something similar but in C++: <a href="https://github.com/rigtorp/c2clat" rel="nofollow">https://github.com/rigtorp/c2clat</a>

评论 #32891741 未加载

评论 #32894435 未加载

评论 #32895144 未加载

评论 #32899633 未加载

hot_gril超过 2 年前

评论 #32894185 未加载

jtorsella超过 2 年前

评论 #32893939 未加载

评论 #32892310 未加载

评论 #32892566 未加载

snvzz超过 2 年前

评论 #32894014 未加载

dan-robertson超过 2 年前

评论 #32893114 未加载

评论 #32891811 未加载

评论 #32891530 未加载

评论 #32891605 未加载

ozcanay超过 2 年前

评论 #32907989 未加载

moep0超过 2 年前

Why does CPU=8 in Intel Core i9-12900K have fast access to all other cores? It is interesting.

评论 #32893421 未加载

评论 #32894075 未加载

评论 #32904030 未加载

评论 #32893200 未加载

bhedgeoser超过 2 年前

bullen超过 2 年前

jeffbee超过 2 年前

评论 #32894003 未加载

vladvasiliu超过 2 年前

评论 #32897471 未加载

mey超过 2 年前

评论 #32894052 未加载

sgtnoodle超过 2 年前

评论 #32891458 未加载

mayapugai超过 2 年前

scrubs超过 2 年前

jesse__超过 2 年前

This is absolutely the coolest thing I've seen in a while.

bullen超过 2 年前

When would cores talk to cores like this is measuring?Would two cores reading and writing to the same memory have this contention?

评论 #32895229 未加载

fideloper超过 2 年前

Because I’m ignorant: What are the practical take aways from this?When is a cpu core sending a message to another core?

评论 #32890685 未加载

评论 #32890483 未加载

评论 #32891259 未加载

评论 #32891307 未加载

评论 #32891457 未加载

zeristor超过 2 年前

I realise these were run on AWS instances, but could this be run locally on Apple Silicon?Erm, I guess I should try.

评论 #32893139 未加载

bee_rider超过 2 年前

Does anyone know what is up with the 8275CL? It looks... almost periodic or something.

评论 #32894249 未加载

stevefan1999超过 2 年前

What about OS scheduling overhead?

评论 #32893698 未加载

throaway53dh超过 2 年前

What's the diff b/w sockets and cores? Does socket have seperate Ln caches and cores share the ca he?