I'm sure TFA's conclusion is right; but its argument would be strengthened by providing the codegen for <i>both</i> versions, instead of just the better version. Quote:<p>"The second wrong thing with the supposedly optimizer [sic] version is that it actually runs much slower than the original version [...] wasting two multiplications and one or two additions. [...] But don't take my word for it, let's look at the generated machine code for the relevant part of the shader"<p>—then proceeds to show only one codegen: the one containing no multiplications or additions. That proves the good version is fine; it doesn't yet prove the bad version is worse.
I wish there was a good way of knowing when an if forces an actual branch rather than when it doesn't. The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.<p>I do like that the most obvious v = x > y ? a : b; actually works, but it's also concerning that we have syntax where an if is some times a branch and some times not. In a context where you really can't branch, you'd almost like branch-if and non-branching-if to be different keywords. The non-branching one would fail compilation if the compiler couldn't do it without branching. The branching one would warn if it could be done with branching.
A lot of the myth that "branches are slow on GPUs" is because, way back on the PlayStation 3, they were quite slow. NVIDIA's RSX GPU was on the PS3; it was documented that it was six cycles IIRC, but it always measured slower than that to me.
That was for even a completely coherent branch, where all threads in the warp took the same path. Incoherent branches were slower because the IFEH instruction took six cycles, and the GPU would have to execute both sides of the branch.
I believe that was the origin of the "branches are slow on GPUs" myth that continues to this day. Nowadays GPU branching is quite cheap especially coherent branches.
These sort of avoid-branches optimizations <i>were</i> effective once upon a time as I profiled them on the XBox 360 and some ancient Intel iGPUs, but yeah - don't do this anymore.<p>Same story for bit extraction and other integer ops - we used to emulate them with float math because it was faster, but now every GPU has fast integer ops.
This article is also relevant: <a href="https://medium.com/@jasonbooth_86226/branching-on-a-gpu-18bfc83694f2" rel="nofollow">https://medium.com/@jasonbooth_86226/branching-on-a-gpu-18bf...</a><p>“If you consult the internet about writing a branch of a GPU, you might think they open the gates of hell and let demons in. They will say you should avoid them at all costs, and that you can avoid them by using the ternary operator or step() and other silly math tricks. Most of this advice is outdated at best, or just plain wrong.<p>Let’s correct that.”
Processors change, compilers change. If you care about such details, best to ship multiple variants and pick the fastest one at runtime.<p>As I've mentioned here several times before, I've made code significantly faster by removing the hand-rolled assembly and replacing it with plain C or similar. While the assembly might have been faster a decade or two ago, things have changed...
Some of the mistakes/confusion being pointed out in the article is being replicated here, it seems.<p>The article is not claiming that conditional branches are free. In fact, the article is not making any point about the performance cost of branching code, as far as I can tell.<p>The article is pointing out that <i>conditional logic</i> in the form presented does not get compiled into <i>conditionally branching</i> code.
And that people should not continue to propagate the harmful advice to cover up every conditional thing <i>in sight</i>[0].<p>Finally, on actually branching code: that branching code is more complicated to execute is self-evident. There are no free branches. Avoiding branches is likely (within reason) to make any code run faster. Luckily[1], the original code was already branchless. As always, there is no universal metric to say whether optimisation is worthwhile.<p>[0] the "in sight" is important -- there's no interest in the generated code, just in the source code not appearing to include conditional anythings.<p>[1] No luck involved, of course ... (I assume people wrote to IQ to suggest apparently glaringly obvious (and wrong) improvements to their shader code, lol)
So why isn't the compiler smart enough to see that the 'optimised' version is the same?<p>Surely it understands "step()" and can optimize the "step()=0.0" and "step()==1.0" cases separately?<p>This is presumably always worth it, because you would at least remove one multiplication (usually turning it into a conditional load/store/something else)
Thanks Inigo !<p><pre><code> The second wrong thing with the supposedly optimizer version is that it actually runs much slower than the original version. The reason is that the step() function is actually implemented like this:
float step( float x, float y )
{
return x < y ? 1.0 : 0.0;
}
</code></pre>
How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives ?
I've been caught by this. Even Claude/ChatGPT will suggest it as an optimisation. Every time I've measured a performance drop doing this. Sometimes significant.
Unrelated, but somehow similar: I really hate it that it's not possible to force gcc to transform things like this into a conditional move:<p>x > c ? y : 0.;<p>It annoyed me many times and it still does.
I think the core problem is that when writing code like this you need experience be sure that it won’t have a conditional branch. How many operations past the conditional cause a branch? Which operations can the compiler elide to bring the total below this count? I’m all for writing direct code and relying on smart compilers but it’s often hard to know if and where I’m going to get bitten. Do I always have to inspect the assembly? Do I need a performance testing suit to check for accidental regressions? I find it much easier if I can give the compiler a hint on what I expect it to do, this would be similar to a @tailcall annotation. That way I can explore the design space without worry that I’ll accidentally overstep a some hard to reason about boundary that will tank the performance.
It's funny because I rarely seen this (wrong approach) done anywhere else but I pick it up by myself (like a lot did I presume) and still am the first to do it everytime I see the occasion, not so for optimizations (while I admit I thought it wouldn't hurt) but for the flow and natural look of it. It feels somehow more right to me to compose effect by signals interpolations rather than clear ternary branch instructions.<p>Now I'll have to change my ways in fear of being rejected socially for this newly approved bad practice.<p>At least in WebGPU's WGSL we have the `select` instruction that does that ternary operation hidden as a method, so there is that.
I believe the conclusion is correct in 2025, but the article in a way just perpetuates the 'misinformation', making it seem like finding if your code will compile to a dynamic branch or not is easier than it is.<p>The unfortunate truth with shaders is that they are compiled by the users machine at the point of use. So compiling it on just your machine isn't nearly good enough. NVIDIA pricing means large numbers of customers are running 10 year old hardware. Depending on target market you might even want the code to run on 10 year old integrated graphics.<p>Does 10 year old integrated graphics across the range of drivers people actually have running prefer conditional moves over more arithmetic ops.. probably, but I would want to keep both versions around and test on real user hardware if this shader was used a lot.
<p><pre><code> So, if you ever see somebody proposing this
float a = mix( b, c, step( y, x ) );
</code></pre>
The author seems unaware of<p><pre><code> float a = mix( b, c, y > x );
</code></pre>
which encodes the desired behavior and also works for vectors:<p><pre><code> The variants of mix where a is genBType select which vector each returned component comes from. For a component of a that is false, the corresponding component of x is returned. For a component of a that is true, the corresponding component of y is returned.</code></pre>
I don’t know enough about these implementations to know if this can be interpreted as a blanket ‘conditionals are fine’ or, rather, ‘ternary operations which select between two themselves non-branching expressions are fine’.<p>Like does this apply if one of the two branches of a conditional is computationally much more expensive? My (very shallow) understanding was that having, eg, a return statement on one branch and a bunch of work on the other would hamstring the GPU’s ability to optimize execution.
I love seeing the codegen output, makes it easy to understand the issue, but claiming that it's faster or slower without actual benchmarks is a bit disappointing.
"of course real branches happen in GPU code"<p>My understanding was that they don't. All executions inside a "branch" always get executed, they're simply predicated to do nothing if the condition to enter is not true.
I've been doing this from day 1 becausee I just assumed you are not supposed to have loops or if else blocks in your shader code. Now I know better, thanks iq ur g.
> For the record, of course real branches do happen in GPU code<p>Well, for some definition of "real". There are hardware features (on some architectures) that implement semantics that evaluate the same way that "branched" scalar code would. There is no branching at the instruction level, and can't be on SIMD (because the other parallel shaders being evaluated by the same instructions might not have taken the same branch!)
It is weird how long misinformation like this sticks around, the conditional move/select approach has been superior for decades on both CPU & GPU, but somehow some people still write the other approach as an "optimization".
I'm not going to second guess IQ, who is one of the greatest modern authorities on shaders, but I do have some counterarguments.<p>- Due to how SIMD works, it's quite likely both paths of the conditional statement get executed, so its a wash<p>- Most importantly, <i>if</i> statements look nasty on the screen. Having an if statement means a discontinuity in visuals, which means jagged and ugly pixels on the output. Of course having a <i>step</i> function doesnt change this, but that means the code is already in the correct form to replace it with <i>smoothstep</i>, which means you can interpolate between the two variations, which does look good.