Measuring CPU core-to-core latency

319 pointsby nviennotover 2 years ago

24 comments

apaolilloover 2 years ago

We published a paper where we captured the same kind of insights (deep numa hierarchies including cache levels, numa nodes, packages) and used them to tailor spinlocks to the underlying machine: <a href="https://dl.acm.org/doi/10.1145/3477132.3483557" rel="nofollow">https://dl.acm.org/doi/10.1145/3477132.3483557</a>

评论 #32892675 未加载

wyldfireover 2 years ago

This is a cool project.It looks kinda like the color scales are normalized to just-this-CPU's latency? It would be neater if the scale represented the same values among CPUs. Or rather, it would be neat if there were an additional view for this data that could make it easier to compare among them.I think the differences are really interesting to consider. What if the scheduler could consider these designs when weighing how to schedule each task? Either statically or somehow empirically? I think I've seen sysfs info that describes the cache hierarchies, so maybe some of this info is available already. That nest [1] scheduler was recently shared on HN, I suppose it may be taking advantage of some of these properties.[1] <a href="https://dl.acm.org/doi/abs/10.1145/3492321.3519585" rel="nofollow">https://dl.acm.org/doi/abs/10.1145/3492321.3519585</a>

评论 #32893169 未加载

评论 #32892826 未加载

rigtorpover 2 years ago

I have something similar but in C++: <a href="https://github.com/rigtorp/c2clat" rel="nofollow">https://github.com/rigtorp/c2clat</a>

评论 #32891741 未加载

评论 #32894435 未加载

评论 #32895144 未加载

评论 #32899633 未加载

hot_grilover 2 years ago

I was wondering what real-life situations this benchmark matters the most in, then I remembered... A few years ago I was working on a uni research project trying to eek out the most performance possible in an x86 software-defined EPC, basically the gateway that sits between the LTE cell tower intranet and the rest of the Internet. The important part for me to optimize was the control plane, which handles handshakes between end users and the gateway (imagine everyone spam-toggling airplane mode when their LTE drops). Cache coherence latency was a bottleneck. The control plane I developed had diminishing returns in throughput up to like 8 cores on a 12-core CPU in our dual-socket test machine. Beyond that, adding cores actually slowed it down significantly*. Not a single-threaded task but not embarrassingly parallel either. The data plane was more parallel, and it ran on a separate NUMA node. Splitting either across NUMA nodes destroyed the performance.* which in hindsight sounds like TurboBoost was enabled, but I vaguely remember it being disabled in tests

评论 #32894185 未加载

jtorsellaover 2 years ago

If anyone is interested, here are the results on my M1 Pro running Asahi Linux:Min: 48.3 Max: 175.0 Mean: 133.0I’ll try to copy the exact results once I have a browser on Asahi, but the general pattern is most pairs have >150ns and a few (0-1; 2-3,4,5; 3-4,5; 4-5; 6-7,8,9; 7-8,9; 8-9) are faster at about 50ns.Edit: The results from c2clat (a little slower, but the format is nicer) are below.CPU 0 1 2 3 4 5 6 7 8 9<pre><code> 0 0 59 231 205 206 206 208 219 210 210 1 59 0 205 215 207 207 209 209 210 210 2 231 205 0 40 42 43 180 222 224 213 3 205 215 40 0 43 43 212 222 213 213 4 206 207 42 43 0 44 182 227 217 217 5 206 207 43 43 44 0 215 215 217 217 6 208 209 180 212 182 215 0 40 43 45 7 219 209 222 222 227 215 40 0 43 43 8 210 210 224 213 217 217 43 43 0 44 9 210 210 213 213 217 217 45 43 44 0</code></pre>

评论 #32893939 未加载

评论 #32892310 未加载

评论 #32892566 未加载

snvzzover 2 years ago

>This software is licensed under the MIT licenseMaybe consider including a MIT license in the repository.Legally, that's a bit more sane than having a line in the readme.In practice, github will recognize your license file and show the license in the indexes an d in the right column of your repository's main page.

评论 #32894014 未加载

dan-robertsonover 2 years ago

It would be interesting to have a more detailed understanding of why these are the latencies, e.g. this repo has ‘clusters’ but there is surely some architectural reason for these clusters. Is it just physical distance on the chip or is there some other design constraint?I find it pretty interesting where the interface that cpu makers present (eg a bunch of equal cores) breaks down.

评论 #32893114 未加载

评论 #32891811 未加载

评论 #32891530 未加载

评论 #32891605 未加载

ozcanayover 2 years ago

I am currently working on my master's degree on computer science and studying on this exact topic.In order to measure core-to-core latency, we should also learn how the cache coherence works on Intel. I am currently experimenting with microbenchmarks on Skylake microarchitecture. Due to the scalability issues with ring interconnect on CPU dies in previous models, Intel opted for 2D mesh interconnect microarchitecture in recent years. In this microarchitecture, CPU die is split into tiles each accommodating cores, caches, CHA, snoop filter etc. I want to emphasize the role of CHA here. Each CHA is responsible for managing coherence of a portion of the addresses. If a core tries to fetch a variable that is not in its L1D or L2 cache, the CHA managing the coherence of the address of the variable being fetched will be queried to learn whereabouts of the variable. If the data is on the die, the core currently owning the variable will be told to forward that variable to the requesting core. So, even though the cores that communicate with each other are physically contiguous, the location of the CHA that manages the coherence of the variable they will pass back and forth also is important due to cache coherence mechanism.Related links:<a href="https://gac.udc.es/~gabriel/files/DAC19-preprint.pdf" rel="nofollow">https://gac.udc.es/~gabriel/files/DAC19-preprint.pdf</a><a href="https://par.nsf.gov/servlets/purl/10278043" rel="nofollow">https://par.nsf.gov/servlets/purl/10278043</a>

评论 #32907989 未加载

moep0over 2 years ago

Why does CPU=8 in Intel Core i9-12900K have fast access to all other cores? It is interesting.

评论 #32893421 未加载

评论 #32894075 未加载

评论 #32904030 未加载

评论 #32893200 未加载

bhedgeoserover 2 years ago

On a 5950x, the latencies for core 0 are very high if SMT is enabled, I wonder why that is?<pre><code> 0 1 0 1 26±0 2 26±0 17±0 3 27±0 17±0 4 32±0 17±0 5 29±0 19±0 6 32±0 18±0 7 31±0 17±0 8 138±1 81±0 9 138±1 83±0 10 139±1 80±0 11 136±1 84±0 12 134±1 83±0 13 137±1 80±0 14 136±1 84±0 15 139±1 84±0 16 16±0 16±0 17 28±0 8±0 18 33±0 17±0 19 29±0 16±0 20 28±0 17±0 21 29±0 19±0 22 32±0 18±0 23 31±0 17±0 24 137±1 81±0 25 140±1 79±0 26 143±1 80±0 27 138±1 82±0 28 139±1 82±0 29 139±1 81±0 30 142±1 82±0 31 142±1 84±0</code></pre>

bullenover 2 years ago

I ran .c2clat on a Raspberry 4:<pre><code> CPU 0 1 2 3 0 0 77 77 77 1 77 0 77 77 2 77 77 0 77 3 77 77 77 0 </code></pre> And Raspberry 2:<pre><code> CPU 0 1 2 3 0 0 71 71 71 1 71 0 71 71 2 71 71 0 71 3 71 71 71 0</code></pre>

jeffbeeover 2 years ago

Fails to build from source with Rust 1.59 so I tried the C++ `c2clat` from elsewhere in the thread. Quite interesting on Alder Lake, because the quartet of Atom cores has uniform latency (they share an L2 cache and other resources) while the core-to-core latency of the Core side of the CPU varies. Note that the way these are logically numbers is 0,1 are SMT threads of the first core and so forth through 14-15. 16-19 are Atom cores with 1 thread each.<pre><code> CPU 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 0 12 60 44 60 44 60 43 50 47 56 48 58 49 60 50 79 79 78 79 1 12 0 45 45 44 44 60 43 51 49 55 47 57 49 56 51 76 76 76 76 2 60 45 0 13 42 43 53 43 48 37 52 41 53 42 53 42 72 72 72 72 3 44 45 13 0 42 43 53 42 47 37 51 40 53 41 53 42 72 72 72 72 4 60 44 42 42 0 13 56 43 49 52 54 41 56 42 42 41 75 75 74 75 5 44 44 43 43 13 0 56 43 51 54 55 41 56 42 56 42 77 77 77 77 6 60 60 53 53 56 56 0 13 49 54 56 41 57 42 57 42 78 78 78 78 7 43 43 43 42 43 43 13 0 46 47 54 41 41 41 55 41 72 71 71 71 8 50 51 48 47 49 51 49 46 0 12 51 51 54 56 55 56 75 75 75 75 9 47 49 37 37 52 54 54 47 12 0 49 53 54 56 55 54 74 69 67 68 10 56 55 52 51 54 55 56 54 51 49 0 13 53 58 56 59 75 75 76 75 11 48 47 41 40 41 41 41 41 51 53 13 0 51 52 55 59 75 75 75 75 12 58 57 53 53 56 56 57 41 54 54 53 51 0 13 55 60 77 77 77 77 13 49 49 42 41 42 42 42 41 56 56 58 52 13 0 55 54 77 77 77 77 14 60 56 53 53 42 56 57 55 55 55 56 55 55 55 0 12 74 70 78 78 15 50 51 42 42 41 42 42 41 56 54 59 59 60 54 12 0 75 74 74 77 16 79 76 72 72 75 77 78 72 75 74 75 75 77 77 74 75 0 55 55 55 17 79 76 72 72 75 77 78 71 75 69 75 75 77 77 70 74 55 0 55 55 18 78 76 72 72 74 77 78 71 75 67 76 75 77 77 78 74 55 55 0 55 19 79 76 72 72 75 77 78 71 75 68 75 75 77 77 78 77 55 55 55 0</code></pre>

评论 #32894003 未加载

vladvasiliuover 2 years ago

This is interesting. I'm getting much worse results on an i7-1165G7 than the ones published:<pre><code> Num cores: 8 Using RDTSC to measure time: true Num round trips per samples: 5000 Num samples: 300 Showing latency=round-trip-time/2 in nanoseconds: 0 1 2 3 4 5 6 7 0 1 70±1 2 53±1 42±0 3 73±5 134±5 80±1 4 16±0 49±1 56±1 46±1 5 63±4 28±1 128±5 67±1 66±1 6 56±1 49±1 10±0 81±4 124±4 72±1 7 57±1 57±1 45±1 10±0 63±4 130±5 87±1 Min latency: 10.1ns ±0.2 cores: (6,2) Max latency: 134.1ns ±5.3 cores: (3,1) Mean latency: 64.7ns</code></pre>

评论 #32897471 未加载

meyover 2 years ago

Here is AMD Ryzen 9 5900x on Windows 11<a href="https://gist.github.com/smarkwell/d72deee656341d53dff469df2bcc6547" rel="nofollow">https://gist.github.com/smarkwell/d72deee656341d53dff469df2b...</a>

评论 #32894052 未加载

sgtnoodleover 2 years ago

I've been doing some latency measurements like this, but between two processes using unix domain sockets. I'm measuring more on the order of 50uS on average, when using FIFO RT scheduling. I suspect the kernel is either letting processes linger for a little bit, or perhaps the "idle" threads tend to call into the kernel and let it do some non-preemptable book keeping.If I crank up the amount of traffic going through the sockets, the average latency drops, presumably due to the processes being able to batch together multiple packets rather than having to block on each one.

评论 #32891458 未加载

mayapugaiover 2 years ago

This is a fascinating insight into a subsystem which we take from granted and naively assume is homogeneous. Thank you so much for sharing.A request to the community - I am particularly interested in the Apple M1 Ultra. Apple made a pretty big fuss about the transparency of their die-to-die interconnect in the M1 Ultra. So, it would be very interesting to see what happens with it - both on Mac OS and (say, Asahi) Linux.

scrubsover 2 years ago

This benchmark reminds me of "ffwd: delegation is (much) faster than you think" <a href="https://www.seltzer.com/margo/teaching/CS508-generic/papers-a1/roghanchi17.pdf" rel="nofollow">https://www.seltzer.com/margo/teaching/CS508-generic/papers-...</a>.This paper describes a mechanism for client threads pinned to a distinct cores to delegate a function call to distinguished server thread pinned to its own core all on the same socket.This has a multitude of applications the most obvious one making a shared data structure MT safe through delegation rather than saddling it with mutexes or other synchronization points especially beneficial with small critical sections.The paper's abstract concludes claiming "100% [improvement] over the next best solution tested (RCL), and multiple micro-benchmarks show improvements in the 5–10× range."The code does delegation without CAS, locks, or atomics.The efficacy of such a scheme rests on two facets, which the paper explains:* Modern CPUs can move GBs/second between core L2/LLC caches* The synchronization between requesting clients and responding servers depends on each side spinning on shared memory address looking for bit toggles. Briefly, servers only read client request memory which the client only writes. (Clients each have their own slot). And on the response side client's read the servers shared response memory, which only the server writes. This one-side read, one-side write is supposed to minimize the number of cache invalidations and MESI syncs.I spent some time testing the author's code and went so far as writing my own version. I was never able to make it work with anywhere near the throughput claimed in the paper. There's also some funny "nop" assembler instructions within the code that I gather is a cheap form of thread yielding.In fact this relatively simple SPCP MT ring buffer which has but a fraction of the code:<a href="https://rigtorp.se/ringbuffer/" rel="nofollow">https://rigtorp.se/ringbuffer/</a>did far, far better.In my experiments then CPU spun too quickly so that core-to-core bandwidth was quickly squandered before the server could signal response or the client could signal request. I wonder if adding select atomic reads as with the SPSC ring might help.

jesse__over 2 years ago

This is absolutely the coolest thing I've seen in a while.

bullenover 2 years ago

When would cores talk to cores like this is measuring?Would two cores reading and writing to the same memory have this contention?

评论 #32895229 未加载

fideloperover 2 years ago

Because I’m ignorant: What are the practical take aways from this?When is a cpu core sending a message to another core?

评论 #32890685 未加载

评论 #32890483 未加载

评论 #32891259 未加载

评论 #32891307 未加载

评论 #32891457 未加载

zeristorover 2 years ago

I realise these were run on AWS instances, but could this be run locally on Apple Silicon?Erm, I guess I should try.

评论 #32893139 未加载

bee_riderover 2 years ago

Does anyone know what is up with the 8275CL? It looks... almost periodic or something.

评论 #32894249 未加载

stevefan1999over 2 years ago

What about OS scheduling overhead?

评论 #32893698 未加载

throaway53dhover 2 years ago

What's the diff b/w sockets and cores? Does socket have seperate Ln caches and cores share the ca he?

24 comments

apaolilloover 2 years ago

评论 #32892675 未加载

wyldfireover 2 years ago

评论 #32893169 未加载

评论 #32892826 未加载

rigtorpover 2 years ago

I have something similar but in C++: <a href="https://github.com/rigtorp/c2clat" rel="nofollow">https://github.com/rigtorp/c2clat</a>

评论 #32891741 未加载

评论 #32894435 未加载

评论 #32895144 未加载

评论 #32899633 未加载

hot_grilover 2 years ago

评论 #32894185 未加载

jtorsellaover 2 years ago

评论 #32893939 未加载

评论 #32892310 未加载

评论 #32892566 未加载

snvzzover 2 years ago

评论 #32894014 未加载

dan-robertsonover 2 years ago

评论 #32893114 未加载

评论 #32891811 未加载

评论 #32891530 未加载

评论 #32891605 未加载

ozcanayover 2 years ago

评论 #32907989 未加载

moep0over 2 years ago

Why does CPU=8 in Intel Core i9-12900K have fast access to all other cores? It is interesting.

评论 #32893421 未加载

评论 #32894075 未加载

评论 #32904030 未加载

评论 #32893200 未加载

bhedgeoserover 2 years ago

bullenover 2 years ago

jeffbeeover 2 years ago

评论 #32894003 未加载

vladvasiliuover 2 years ago

评论 #32897471 未加载

meyover 2 years ago

评论 #32894052 未加载

sgtnoodleover 2 years ago

评论 #32891458 未加载

mayapugaiover 2 years ago

scrubsover 2 years ago

jesse__over 2 years ago

This is absolutely the coolest thing I've seen in a while.

bullenover 2 years ago

When would cores talk to cores like this is measuring?Would two cores reading and writing to the same memory have this contention?

评论 #32895229 未加载

fideloperover 2 years ago

Because I’m ignorant: What are the practical take aways from this?When is a cpu core sending a message to another core?

评论 #32890685 未加载

评论 #32890483 未加载

评论 #32891259 未加载

评论 #32891307 未加载

评论 #32891457 未加载

zeristorover 2 years ago

I realise these were run on AWS instances, but could this be run locally on Apple Silicon?Erm, I guess I should try.

评论 #32893139 未加载

bee_riderover 2 years ago

Does anyone know what is up with the 8275CL? It looks... almost periodic or something.

评论 #32894249 未加载

stevefan1999over 2 years ago

What about OS scheduling overhead?

评论 #32893698 未加载

throaway53dhover 2 years ago

What's the diff b/w sockets and cores? Does socket have seperate Ln caches and cores share the ca he?