I think this is a case of mis-assigned blame (on the tool’s part, not the author’s). My semi-educated guesswork follows:<p>Looking at the disassembly screenshots in the article, the total runtime for the benchmark doesn’t appear to have decreased by very much. “Time per op” has decreased by half in max_lol(), but the total number of ops being performed has likely increased, too - Specifically, extra work was done “for free” (As was shown in min_max).<p>This experiment is showing us that the compiler is in fact doing exactly what we want - maximizing throughput in the face of a stall by pipelining!<p>In this experiment, maxV is potentially being written to with each iteration of the loop. Valid execution of the next iteration requires us to wait on that updated value of maxV. This comparison and write takes longer than just running an instruction - it’s a stall point.<p>In the first profile, the compare instruction gets full credit for all the time the CPU is stalled waiting on that value to be written - there’s nothing else it can be doing at the time.<p>In the other profiles, we see a more “honest” picture of how long the comparison takes. After the compare instruction is done, we move on to other things which DON’T rely on knowing maxV - printing “lol”, or doing the other compare and write for minV.<p>I propose that the processor is doing our added work on every iteration of the loop, no matter what (And why not? Nothing else would be happening in that time). That BLT instruction isn’t making things faster, it’s just deciding whether to throw away the results of our extra work or keep it.<p>Throughput is important, but not always the same thing as speed. It’s good to keep that in mind with metrics like ops/time, particularly if the benchmarking tool tries to blame stuff like cache misses or other stalls on a (relatively) innocent compare instruction!
I read it and I still don't get it, can someone (re-)explain what the presence of the print() is doing that is helpful for branch prediction (or any other aspect of the CPU)?<p>Update: It seems to be the conditional move, see <a href="https://news.ycombinator.com/item?id=37245325">https://news.ycombinator.com/item?id=37245325</a>
Processor "optimizations" can produce surprising effects. The problem is these optimizations are not programmatically accessible to C (or most modern programming languages) given their simple memory model. <i>Deterministic</i> performance is not easy to obtain. My view is to not bother with such tricks unless absolutely necessary (and be prepared that your changes may actually <i>pessimize</i> performance on a future processor or a compatible processor by a different vendor).<p>If you are interested in this sort of thing, check out comp.arch!
In the Linux kernel, there are unlikely() and likely() macros which indicate to the compiler whether or not a condition is likely using __builtin_expect (which then influences the output assembly into producing code that should make the branch predictor do the right thing more of the time).<p>Unfortunately, the issue here is that the performance depends on the input and so such hints wouldn't help (unless you knew a-priori you were dealing with mostly-sorted data). Presumably the min-max (and lol) versions perform worse for descending arrays?
Why would an unconditional print have any effect on whether the branch predictor is invoked or not? The if statement is there in both cases, so branch prediction should kick in for both. I didn't find an explanation for this behaviour in the article.
I'm a noob. Looking at the disasm: <a href="https://godbolt.org/z/766aPTPc3" rel="nofollow noreferrer">https://godbolt.org/z/766aPTPc3</a>
It turns a CMOVQLT to a JLT. Is the blog saying CMOVQLT don't have branch predication?
I don't get it.
I was curious what this strange assembly language was, as it looked like neither Arm nor x86.<p>Apparently the Go toolchain has its own assembly language which partially abstracts away some architectural differences: <a href="https://go.dev/doc/asm" rel="nofollow noreferrer">https://go.dev/doc/asm</a><p>I wonder what the advantages are? It feels like as soon as you move away from the basics, the architecture-specific differences will negate most usefulness of the abstraction.
The explanation is not convincing.<p>My guess is some kind of measurement error or one of the "load bearing nop" phenomena. By that I mean the alignment of instructions (esp branch targets?) can dramatically affect performance and compilers apparently have rather simplistic models for this or don't consider it at all.
Does Go have any facility for providing hints to the optimiser (like how some C compilers support #pragmas) that could cause the branch-predicted instruction to be used rather than the slower one?
Go can have unexpected performance differences way higher up in the stack.<p>Ask me about that one time I optimized code that was deadlocking because the Go compiler managed to not insert a Gosched[1] call into a loop transforming data that took ~30 minutes or so. The solution could've been to call Gosched, but optimizing the loop to a few seconds turned out to be easier.<p>I assume the inverse - the go compiler adding too many Goscheds - can happen too. It's not that expensive - testing a condition - but if you do that a few million times, things add up.<p>[1]: <a href="https://pkg.go.dev/runtime#Gosched" rel="nofollow noreferrer">https://pkg.go.dev/runtime#Gosched</a>
There’s a go course that was really good about this level of nuance. He talks a lot about mechanical sympathy and how to dig in detail with this. I think it’s called ultimate go?
I tried to come up with the most efficient implementation of this rather simple function that I could think of with pure Go without going down to SIMD Assembly: <a href="https://go.dev/play/p/zHFxwvWOoeT" rel="nofollow noreferrer">https://go.dev/play/p/zHFxwvWOoeT</a><p>-32.31% geomean across the different tests looks rather great.
Any ideas how to make it even faster?
Most languages have a `max` function, so the core of the loop could be written with just something like: `maxV = max(maxV, v)`<p>That could be entirely branchless, right?
Kind of tangential, but who are these people who are so comfortable with disassembling a high level language binary, reading assembly, and then making statements about branch prediction and other such low level esoterica? I've only ever meet people like that maybe two or thee times in my career, and yet it seems like every other blog post I read in certain language circles everyone is some kind of ASM and Reverse Engineering expert.