Don't "optimize" conditional moves in shaders with mix()+step()

409 pointsby romes3 months ago

27 comments

I'm sure TFA's conclusion is right; but its argument would be strengthened by providing the codegen for both versions, instead of just the better version. Quote:"The second wrong thing with the supposedly optimizer [sic] version is that it actually runs much slower than the original version [...] wasting two multiplications and one or two additions. [...] But don't take my word for it, let's look at the generated machine code for the relevant part of the shader"—then proceeds to show only one codegen: the one containing no multiplications or additions. That proves the good version is fine; it doesn't yet prove the bad version is worse.

评论 #42990979 未加载

评论 #42995183 未加载

评论 #42991564 未加载

alkonaut3 months ago

I wish there was a good way of knowing when an if forces an actual branch rather than when it doesn't. The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.I do like that the most obvious v = x > y ? a : b; actually works, but it's also concerning that we have syntax where an if is some times a branch and some times not. In a context where you really can't branch, you'd almost like branch-if and non-branching-if to be different keywords. The non-branching one would fail compilation if the compiler couldn't do it without branching. The branching one would warn if it could be done with branching.

评论 #42992555 未加载

评论 #42991387 未加载

评论 #42998648 未加载

评论 #42992321 未加载

评论 #42991686 未加载

评论 #43008553 未加载

评论 #42993154 未加载

评论 #42991661 未加载

评论 #42991789 未加载

评论 #42993433 未加载

评论 #42991054 未加载

nosferalatu1233 months ago

A lot of the myth that "branches are slow on GPUs" is because, way back on the PlayStation 3, they were quite slow. NVIDIA's RSX GPU was on the PS3; it was documented that it was six cycles IIRC, but it always measured slower than that to me. That was for even a completely coherent branch, where all threads in the warp took the same path. Incoherent branches were slower because the IFEH instruction took six cycles, and the GPU would have to execute both sides of the branch. I believe that was the origin of the "branches are slow on GPUs" myth that continues to this day. Nowadays GPU branching is quite cheap especially coherent branches.

评论 #43001683 未加载

评论 #42997628 未加载

aappleby3 months ago

These sort of avoid-branches optimizations were effective once upon a time as I profiled them on the XBox 360 and some ancient Intel iGPUs, but yeah - don't do this anymore.Same story for bit extraction and other integer ops - we used to emulate them with float math because it was faster, but now every GPU has fast integer ops.

评论 #42992060 未加载

评论 #42995985 未加载

layer83 months ago

This article is also relevant: <a href="https://medium.com/@jasonbooth_86226/branching-on-a-gpu-18bfc83694f2" rel="nofollow">https://medium.com/@jasonbooth_86226/branching-on-a-gpu-18bf...</a>“If you consult the internet about writing a branch of a GPU, you might think they open the gates of hell and let demons in. They will say you should avoid them at all costs, and that you can avoid them by using the ternary operator or step() and other silly math tricks. Most of this advice is outdated at best, or just plain wrong.Let’s correct that.”

magicalhippo3 months ago

Processors change, compilers change. If you care about such details, best to ship multiple variants and pick the fastest one at runtime.As I've mentioned here several times before, I've made code significantly faster by removing the hand-rolled assembly and replacing it with plain C or similar. While the assembly might have been faster a decade or two ago, things have changed...

评论 #42991626 未加载

评论 #42990776 未加载

评论 #42992926 未加载

qwery3 months ago

Some of the mistakes/confusion being pointed out in the article is being replicated here, it seems.The article is not claiming that conditional branches are free. In fact, the article is not making any point about the performance cost of branching code, as far as I can tell.The article is pointing out that conditional logic in the form presented does not get compiled into conditionally branching code. And that people should not continue to propagate the harmful advice to cover up every conditional thing in sight[0].Finally, on actually branching code: that branching code is more complicated to execute is self-evident. There are no free branches. Avoiding branches is likely (within reason) to make any code run faster. Luckily[1], the original code was already branchless. As always, there is no universal metric to say whether optimisation is worthwhile.[0] the "in sight" is important -- there's no interest in the generated code, just in the source code not appearing to include conditional anythings.[1] No luck involved, of course ... (I assume people wrote to IQ to suggest apparently glaringly obvious (and wrong) improvements to their shader code, lol)

londons_explore3 months ago

So why isn't the compiler smart enough to see that the 'optimised' version is the same?Surely it understands "step()" and can optimize the "step()=0.0" and "step()==1.0" cases separately?This is presumably always worth it, because you would at least remove one multiplication (usually turning it into a conditional load/store/something else)

评论 #42993665 未加载

评论 #43001747 未加载

ttoinou3 months ago

Thanks Inigo !<pre><code> The second wrong thing with the supposedly optimizer version is that it actually runs much slower than the original version. The reason is that the step() function is actually implemented like this: float step( float x, float y ) { return x < y ? 1.0 : 0.0; } </code></pre> How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives ?

评论 #42990635 未加载

评论 #42998914 未加载

评论 #42996583 未加载

评论 #42990713 未加载

评论 #42991035 未加载

mirsadm3 months ago

I've been caught by this. Even Claude/ChatGPT will suggest it as an optimisation. Every time I've measured a performance drop doing this. Sometimes significant.

评论 #42990696 未加载

评论 #42991446 未加载

cwillu3 months ago

Hmm, godbolt is showing branches in the vulkan output:<pre><code> return x>0.923880?vec2(s.x,0.0): x>0.382683?s*sqrt(0.5): vec2(0.0,s.y); </code></pre> turns into<pre><code> %24 = OpLoad %float %x %27 = OpFOrdGreaterThan %bool %24 %float_0_923879981 OpSelectionMerge %30 None OpBranchConditional %27 %29 %35 %29 = OpLabel %31 = OpAccessChain %_ptr_Function_float %s %uint_0 %32 = OpLoad %float %31 %34 = OpCompositeConstruct %v2float %32 %float_0 OpStore %28 %34 OpBranch %30 %35 = OpLabel %36 = OpLoad %float %x %38 = OpFOrdGreaterThan %bool %36 %float_0_382683009 OpSelectionMerge %41 None OpBranchConditional %38 %40 %45 %40 = OpLabel %42 = OpLoad %v2float %s %44 = OpVectorTimesScalar %v2float %42 %float_0_707106769 OpStore %39 %44 OpBranch %41 %45 = OpLabel %47 = OpAccessChain %_ptr_Function_float %s %uint_1 %48 = OpLoad %float %47 %49 = OpCompositeConstruct %v2float %float_0 %48 OpStore %39 %49 OpBranch %41 %41 = OpLabel %50 = OpLoad %v2float %39 OpStore %28 %50 OpBranch %30 %30 = OpLabel %51 = OpLoad %v2float %28 OpReturnValue %51 </code></pre> <a href="https://godbolt.org/z/aqob7YfWq" rel="nofollow">https://godbolt.org/z/aqob7YfWq</a>

评论 #42996520 未加载

toredo1729_23 months ago

Unrelated, but somehow similar: I really hate it that it's not possible to force gcc to transform things like this into a conditional move:x > c ? y : 0.;It annoyed me many times and it still does.

评论 #42990613 未加载

评论 #42990709 未加载

cjbgkagh3 months ago

I think the core problem is that when writing code like this you need experience be sure that it won’t have a conditional branch. How many operations past the conditional cause a branch? Which operations can the compiler elide to bring the total below this count? I’m all for writing direct code and relying on smart compilers but it’s often hard to know if and where I’m going to get bitten. Do I always have to inspect the assembly? Do I need a performance testing suit to check for accidental regressions? I find it much easier if I can give the compiler a hint on what I expect it to do, this would be similar to a @tailcall annotation. That way I can explore the design space without worry that I’ll accidentally overstep a some hard to reason about boundary that will tank the performance.

DrNosferatu3 months ago

This should be quantified and generalized for a full set of cases - that way the argument would stand far more clearly.

评论 #42991802 未加载

tsylba3 months ago

It's funny because I rarely seen this (wrong approach) done anywhere else but I pick it up by myself (like a lot did I presume) and still am the first to do it everytime I see the occasion, not so for optimizations (while I admit I thought it wouldn't hurt) but for the flow and natural look of it. It feels somehow more right to me to compose effect by signals interpolations rather than clear ternary branch instructions.Now I'll have to change my ways in fear of being rejected socially for this newly approved bad practice.At least in WebGPU's WGSL we have the `select` instruction that does that ternary operation hidden as a method, so there is that.

grumpy_coder3 months ago

I believe the conclusion is correct in 2025, but the article in a way just perpetuates the 'misinformation', making it seem like finding if your code will compile to a dynamic branch or not is easier than it is.The unfortunate truth with shaders is that they are compiled by the users machine at the point of use. So compiling it on just your machine isn't nearly good enough. NVIDIA pricing means large numbers of customers are running 10 year old hardware. Depending on target market you might even want the code to run on 10 year old integrated graphics.Does 10 year old integrated graphics across the range of drivers people actually have running prefer conditional moves over more arithmetic ops.. probably, but I would want to keep both versions around and test on real user hardware if this shader was used a lot.

mahkoh3 months ago

<pre><code> So, if you ever see somebody proposing this float a = mix( b, c, step( y, x ) ); </code></pre> The author seems unaware of<pre><code> float a = mix( b, c, y > x ); </code></pre> which encodes the desired behavior and also works for vectors:<pre><code> The variants of mix where a is genBType select which vector each returned component comes from. For a component of a that is false, the corresponding component of x is returned. For a component of a that is true, the corresponding component of y is returned.</code></pre>

评论 #42990918 未加载

评论 #42992290 未加载

评论 #42991893 未加载

评论 #42995850 未加载

doctorhandshake3 months ago

I don’t know enough about these implementations to know if this can be interpreted as a blanket ‘conditionals are fine’ or, rather, ‘ternary operations which select between two themselves non-branching expressions are fine’.Like does this apply if one of the two branches of a conditional is computationally much more expensive? My (very shallow) understanding was that having, eg, a return statement on one branch and a bunch of work on the other would hamstring the GPU’s ability to optimize execution.

评论 #42992125 未加载

评论 #42990830 未加载

CountHackulus3 months ago

I love seeing the codegen output, makes it easy to understand the issue, but claiming that it's faster or slower without actual benchmarks is a bit disappointing.

评论 #42992002 未加载

ryao3 months ago

Do shader compilers have optimization passes to undo this mistake and if not, could they be added?

评论 #42990912 未加载

mgaunard3 months ago

"of course real branches happen in GPU code"My understanding was that they don't. All executions inside a "branch" always get executed, they're simply predicated to do nothing if the condition to enter is not true.

评论 #42992335 未加载

lerp-io3 months ago

I've been doing this from day 1 becausee I just assumed you are not supposed to have loops or if else blocks in your shader code. Now I know better, thanks iq ur g.

arbitrandomuser3 months ago

What is the AMD and Microsoft cshader compiler , how do I generate and inpect these intermediate codes on my computer?

评论 #42993199 未加载

ajross3 months ago

> For the record, of course real branches do happen in GPU codeWell, for some definition of "real". There are hardware features (on some architectures) that implement semantics that evaluate the same way that "branched" scalar code would. There is no branching at the instruction level, and can't be on SIMD (because the other parallel shaders being evaluated by the same instructions might not have taken the same branch!)

评论 #42999104 未加载

leguminous3 months ago

Using `mix()` isn't necessarily bad. Using boolean operations like `lessThan()` is probably better than `step()`. I just tested two ways of converting linear RGB to sRGB. On AMD, they compile to the same assembly.Method 1.<pre><code> float linear_to_srgb(float v) { return v < 0.0031308 ? v * 12.92 : 1.055 * pow(v, 1.0 / 2.4) - 0.055; } vec3 linear_to_srgb(vec3 rgb) { return vec3(linear_to_srgb(rgb.r), linear_to_srgb(rgb.g), linear_to_srgb(rgb.b)); } </code></pre> Method 2:<pre><code> vec3 linear_to_srgb(vec3 rgb) { bvec3 cutoff = lessThan(rgb, vec3(0.0031308)); vec3 upper = vec3(1.055) * pow(rgb, vec3(1.0 / 2.4)) - vec3(0.055); vec3 lower = rgb * vec3(12.92); return mix(upper, lower, cutoff); }</code></pre>

TinkersW3 months ago

It is weird how long misinformation like this sticks around, the conditional move/select approach has been superior for decades on both CPU & GPU, but somehow some people still write the other approach as an "optimization".

评论 #42995405 未加载

torginus3 months ago

I'm not going to second guess IQ, who is one of the greatest modern authorities on shaders, but I do have some counterarguments.- Due to how SIMD works, it's quite likely both paths of the conditional statement get executed, so its a wash- Most importantly, if statements look nasty on the screen. Having an if statement means a discontinuity in visuals, which means jagged and ugly pixels on the output. Of course having a step function doesnt change this, but that means the code is already in the correct form to replace it with smoothstep, which means you can interpolate between the two variations, which does look good.

评论 #42999049 未加载