One fun thing about this is that spilling+restoring the register will fix it, so if any kind of context switch happens (thread switch, page fault, interrupt, etc.), the register will get pushed to the stack and popped back from it, and the code suddenly gets 3x faster. Makes it a bit tricky to reproduce reliably, and led me down a few dead ends as I was writing this up.
How about trying "xor ecx, ecx; inc ecx"? Or the even shorter "mov cl, 1"?<p><i>It is very strange to me that the instruction used to set the shift count register can make the SHLX instruction 3× slower.</i><p>I suspect this is a width restriction in the bypass/forwarding network.<p><i>The 32-bit vs. 64-bit operand size distinction is especially surprising to me as SHLX only looks at the bottom 6 bits of the shift count.</i><p>Unfortunately the dependency analysis circuitry seems not Intel-ligent enough to make that distinction.
So the issue is that the frontend which "eliminates" those operations doesn't make the resulting values available to the backend if fits Intel's ad-hoc definition of a short immediate (which shift values do) and since there is no SHLX with immediate uop the small value has to be be expanded to a full 32 or 64 bits and requested from the frontend with adds two cycles of latency?<p>Has anyone run tests with ≥3 interleaved SHLX dependency chains in a loop? Does it "just" have 3 cycle latency or also less than 1 operation/cycle sustained throughput? Because if the pipeline stalls that would be even more annoying for existing optimised code.<p>Is the regression limited to only Alder Lake P-cores or also present it later (refreshed) cores?
Worth noting that whether intentional or not, this would be easy to miss and unlikely to move benchmark numbers since compilers won't generate instructions like this: they would use the eax form which is 1 byte shorter and functionally equivalent.<p>Even some assemblers will optimize this for you.
Woah that's weird. Left shifting either takes 3 cycles or 1 cycle, depending on how you initialize the cycle count register?<p>This patch from the article makes it take 1 cycle instead of 3:<p><pre><code> - MOV RCX, 1
+ MOV ECX, 1
</code></pre>
>It seems like SHLX performs differently depending on how the shift count register is initialized. If you use a 64-bit instruction with an immediate, performance is slow. This is also true for instructions like INC (which is similar to ADD with a 1 immediate).<p>Practically speaking, is this sort of µop-dependent optimization implemented by compilers? How do they do so?
Shifts were not always fast. These old hacker news comments contain the details: <a href="https://news.ycombinator.com/item?id=2962770">https://news.ycombinator.com/item?id=2962770</a>
Code alignment? I mean, the different instructions change the alignment of the rest of the code.<p>- 64 bit register<p><pre><code> 0: 48 c7 c1 01 00 00 00 mov rcx,0x1
- 32 bit register
0: b9 01 00 00 00 mov ecx,0x1
</code></pre>
It should be easy to test by adding a couple of NOPs to the fast version:<p><pre><code> 0: b9 01 00 00 00 mov ecx,0x1
5: 90 nop
6: 90 nop
</code></pre>
and see if it regresses again.<p>I don't have an Alder Lake to test on.
Seems like LLVM knows about this quirk (note how it suddenly uses eax instead of rax for the multiply): <a href="https://rust.godbolt.org/z/8jh7YPhz4" rel="nofollow">https://rust.godbolt.org/z/8jh7YPhz4</a>.
Wrote up the presumed explanation here: <a href="https://tavianator.com/2025/shlxplained.html" rel="nofollow">https://tavianator.com/2025/shlxplained.html</a>
To rule out alignment you should adding padding to one so the two variations have the same alignment in their long run of SHLX (I don't actually think it's alignment related though).
Since this seems like an optimization going awry somewhere, I wonder if there's a chicken bit that disables it, and if so, how broad the impact of disabling it is...
Hmm… does this have any impact on time-constant cryptographic algorithm implementations? In particular the wider "addition in register renamer" story?