TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Intel i7 loop performance anomaly (2013)

83 pointsby prandoover 7 years ago

7 comments

pbsdover 7 years ago
The call shifts the loop from being limited at the backend level (mostly uops not being retired by having to wait for memory) to being limited at the frontend level. The core event to look for here is `IDQ_UOPS_NOT_DELIVERED.CORE`, which tells us whether the frontend is delivering the full 4 uops per cycle it is able to. In the tight loop this is almost always the case, whereas in the call loop this is rarely the case.<p>CALL, RET, and JNE all share the same execution port (6 in Skylake), so it seems plausible that the added pressure in this port prevents speculative execution from continuing with the loop at the same rate as the tight loop. If you look at the execution port breakdowns of each loop, port 6 dominates in the call loop, whereas the tight loop is bottlenecked at port 4 (the port where stores go).<p>By delivering fewer uops per cycle, the pressure on the backend is eased. But this is a delicate balance. If you add another call, the loop becomes much slower than the tight loop.<p>You can get a similar effect by replacing `__asm__(&quot;call foo&quot;)` with<p><pre><code> __asm__(&quot;jmp 1f\n1:\n&quot;); __asm__(&quot;jmp 1f\n1:\n&quot;); </code></pre> which consumes the same amount of port 6.
评论 #15937154 未加载
评论 #15936818 未加载
bufferoverflowover 7 years ago
99.9% of the time these anomalies are either due to cache&#x2F;branching or due to alignment.
评论 #15936188 未加载
zwerdldsover 7 years ago
Article from 2013
评论 #15936091 未加载
评论 #15935400 未加载
评论 #15935499 未加载
nimosover 7 years ago
I wonder what happens if you replace the call with 1-2 noops?
inetknghtover 7 years ago
Disclaimer: I am not an expert and have not measured. This is armchair theory. But, I would argue two things.<p>First, the former appears to have at least one unaligned arithmetic:<p>&gt; <i>400538: mov 0x200b01(%rip),%rdx # 601040 &lt;counter&gt;</i><p>...while the latter&#x27;s equivalent instruction is 4-byte aligned:<p>&gt; <i>40057d: mov 0x200abc(%rip),%rdx # 601040 &lt;counter&gt;</i><p>So, I would argue that&#x27;s the biggest source of _speedup_ in the second case. However, I&#x27;m really interested in whether that&#x27;s true since I don&#x27;t see a memory fence; so the memory should be in L0 cache for both cases; I have trouble believing that an unaligned access can be so much slower with the data in cache.<p>As for the `callq` to `repz retq`, I would venture a guess that the CPU&#x27;s able to identify that there are no data dependencies there and the data&#x27;s never even stored; I&#x27;d argue that it probably never even gets executed because the instruction should fit in instruction cache and branch prediction cache and all. Arguably. Like I said, I&#x27;m not an expert.<p>I&#x27;d say run it through Intel&#x27;s code analyzer tool.<p><a href="https:&#x2F;&#x2F;software.intel.com&#x2F;en-us&#x2F;articles&#x2F;intel-architecture-code-analyzer" rel="nofollow">https:&#x2F;&#x2F;software.intel.com&#x2F;en-us&#x2F;articles&#x2F;intel-architecture...</a><p>Tangential video worth watching:<p><a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=2EWejmkKlxs&amp;feature=youtu.be&amp;t=2466" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=2EWejmkKlxs&amp;feature=youtu.be...</a><p>Edit: actually, thinking about it, it&#x27;s not unaligned <i>access</i>, it&#x27;s unaligned <i>math</i>. I don&#x27;t think that should affect performance at all? Fun.
评论 #15936594 未加载
dmixover 7 years ago
Was the exact reason why ever figured out?
maxk42over 7 years ago
Educated guess: The processor is trying to prefetch instructions. This loop is much tighter than most code that would typically be written, so an incrementing loop causes a branch misprediction. The processor is still loading instructions, so when it goes to find out what to do next, it takes a cache miss and burns some time trying to figure out its next instruction. However, a function call is very slow, however (even to a nop function) and it could delay the processor just long enough for the prefetch to complete.
评论 #15936030 未加载