TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

The Alder Lake SHLX Anomaly

234 pointsby panic5 months ago

13 comments

tavianator5 months ago
One fun thing about this is that spilling+restoring the register will fix it, so if any kind of context switch happens (thread switch, page fault, interrupt, etc.), the register will get pushed to the stack and popped back from it, and the code suddenly gets 3x faster. Makes it a bit tricky to reproduce reliably, and led me down a few dead ends as I was writing this up.
评论 #42584016 未加载
评论 #42586983 未加载
评论 #42580327 未加载
userbinator5 months ago
How about trying &quot;xor ecx, ecx; inc ecx&quot;? Or the even shorter &quot;mov cl, 1&quot;?<p><i>It is very strange to me that the instruction used to set the shift count register can make the SHLX instruction 3× slower.</i><p>I suspect this is a width restriction in the bypass&#x2F;forwarding network.<p><i>The 32-bit vs. 64-bit operand size distinction is especially surprising to me as SHLX only looks at the bottom 6 bits of the shift count.</i><p>Unfortunately the dependency analysis circuitry seems not Intel-ligent enough to make that distinction.
评论 #42580547 未加载
评论 #42580605 未加载
crest4 months ago
So the issue is that the frontend which &quot;eliminates&quot; those operations doesn&#x27;t make the resulting values available to the backend if fits Intel&#x27;s ad-hoc definition of a short immediate (which shift values do) and since there is no SHLX with immediate uop the small value has to be be expanded to a full 32 or 64 bits and requested from the frontend with adds two cycles of latency?<p>Has anyone run tests with ≥3 interleaved SHLX dependency chains in a loop? Does it &quot;just&quot; have 3 cycle latency or also less than 1 operation&#x2F;cycle sustained throughput? Because if the pipeline stalls that would be even more annoying for existing optimised code.<p>Is the regression limited to only Alder Lake P-cores or also present it later (refreshed) cores?
BeeOnRope5 months ago
Worth noting that whether intentional or not, this would be easy to miss and unlikely to move benchmark numbers since compilers won&#x27;t generate instructions like this: they would use the eax form which is 1 byte shorter and functionally equivalent.<p>Even some assemblers will optimize this for you.
评论 #42581454 未加载
评论 #42582478 未加载
aftbit5 months ago
Woah that&#x27;s weird. Left shifting either takes 3 cycles or 1 cycle, depending on how you initialize the cycle count register?<p>This patch from the article makes it take 1 cycle instead of 3:<p><pre><code> - MOV RCX, 1 + MOV ECX, 1 </code></pre> &gt;It seems like SHLX performs differently depending on how the shift count register is initialized. If you use a 64-bit instruction with an immediate, performance is slow. This is also true for instructions like INC (which is similar to ADD with a 1 immediate).<p>Practically speaking, is this sort of µop-dependent optimization implemented by compilers? How do they do so?
评论 #42580194 未加载
评论 #42580755 未加载
bhouston5 months ago
Shifts were not always fast. These old hacker news comments contain the details: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=2962770">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=2962770</a>
评论 #42580666 未加载
juancn4 months ago
Code alignment? I mean, the different instructions change the alignment of the rest of the code.<p>- 64 bit register<p><pre><code> 0: 48 c7 c1 01 00 00 00 mov rcx,0x1 - 32 bit register 0: b9 01 00 00 00 mov ecx,0x1 </code></pre> It should be easy to test by adding a couple of NOPs to the fast version:<p><pre><code> 0: b9 01 00 00 00 mov ecx,0x1 5: 90 nop 6: 90 nop </code></pre> and see if it regresses again.<p>I don&#x27;t have an Alder Lake to test on.
评论 #42589440 未加载
评论 #42589196 未加载
orlp4 months ago
Seems like LLVM knows about this quirk (note how it suddenly uses eax instead of rax for the multiply): <a href="https:&#x2F;&#x2F;rust.godbolt.org&#x2F;z&#x2F;8jh7YPhz4" rel="nofollow">https:&#x2F;&#x2F;rust.godbolt.org&#x2F;z&#x2F;8jh7YPhz4</a>.
评论 #42587931 未加载
tavianator4 months ago
Wrote up the presumed explanation here: <a href="https:&#x2F;&#x2F;tavianator.com&#x2F;2025&#x2F;shlxplained.html" rel="nofollow">https:&#x2F;&#x2F;tavianator.com&#x2F;2025&#x2F;shlxplained.html</a>
BeeOnRope5 months ago
To rule out alignment you should adding padding to one so the two variations have the same alignment in their long run of SHLX (I don&#x27;t actually think it&#x27;s alignment related though).
评论 #42582623 未加载
rincebrain4 months ago
Since this seems like an optimization going awry somewhere, I wonder if there&#x27;s a chicken bit that disables it, and if so, how broad the impact of disabling it is...
eqvinox4 months ago
Hmm… does this have any impact on time-constant cryptographic algorithm implementations? In particular the wider &quot;addition in register renamer&quot; story?
评论 #42588358 未加载
评论 #42588352 未加载
BeeOnRope5 months ago
Does this also occur with other 3-argument instructions like ANDN?
评论 #42581473 未加载