Print(“lol”) doubled the speed of my Go function

255 pointsby ludiludiover 1 year ago

16 comments

Syrail_over 1 year ago

I think this is a case of mis-assigned blame (on the tool’s part, not the author’s). My semi-educated guesswork follows:Looking at the disassembly screenshots in the article, the total runtime for the benchmark doesn’t appear to have decreased by very much. “Time per op” has decreased by half in max_lol(), but the total number of ops being performed has likely increased, too - Specifically, extra work was done “for free” (As was shown in min_max).This experiment is showing us that the compiler is in fact doing exactly what we want - maximizing throughput in the face of a stall by pipelining!In this experiment, maxV is potentially being written to with each iteration of the loop. Valid execution of the next iteration requires us to wait on that updated value of maxV. This comparison and write takes longer than just running an instruction - it’s a stall point.In the first profile, the compare instruction gets full credit for all the time the CPU is stalled waiting on that value to be written - there’s nothing else it can be doing at the time.In the other profiles, we see a more “honest” picture of how long the comparison takes. After the compare instruction is done, we move on to other things which DON’T rely on knowing maxV - printing “lol”, or doing the other compare and write for minV.I propose that the processor is doing our added work on every iteration of the loop, no matter what (And why not? Nothing else would be happening in that time). That BLT instruction isn’t making things faster, it’s just deciding whether to throw away the results of our extra work or keep it.Throughput is important, but not always the same thing as speed. It’s good to keep that in mind with metrics like ops/time, particularly if the benchmarking tool tries to blame stuff like cache misses or other stalls on a (relatively) innocent compare instruction!

评论 #37246043 未加载

dataflowover 1 year ago

I read it and I still don't get it, can someone (re-)explain what the presence of the print() is doing that is helpful for branch prediction (or any other aspect of the CPU)?Update: It seems to be the conditional move, see <a href="https://news.ycombinator.com/item?id=37245325">https://news.ycombinator.com/item?id=37245325</a>

评论 #37245386 未加载

评论 #37250682 未加载

评论 #37245250 未加载

评论 #37245591 未加载

bakulover 1 year ago

Processor "optimizations" can produce surprising effects. The problem is these optimizations are not programmatically accessible to C (or most modern programming languages) given their simple memory model. Deterministic performance is not easy to obtain. My view is to not bother with such tricks unless absolutely necessary (and be prepared that your changes may actually pessimize performance on a future processor or a compatible processor by a different vendor).If you are interested in this sort of thing, check out comp.arch!

评论 #37247577 未加载

评论 #37246770 未加载

cypharover 1 year ago

In the Linux kernel, there are unlikely() and likely() macros which indicate to the compiler whether or not a condition is likely using __builtin_expect (which then influences the output assembly into producing code that should make the branch predictor do the right thing more of the time).Unfortunately, the issue here is that the performance depends on the input and so such hints wouldn't help (unless you knew a-priori you were dealing with mostly-sorted data). Presumably the min-max (and lol) versions perform worse for descending arrays?

评论 #37245313 未加载

zerrover 1 year ago

No explanation whatsoever. Why the branch predictor is not "invoked" in the first version of the function?

评论 #37246752 未加载

perryizgr8over 1 year ago

Why would an unconditional print have any effect on whether the branch predictor is invoked or not? The if statement is there in both cases, so branch prediction should kick in for both. I didn't find an explanation for this behaviour in the article.

评论 #37246028 未加载

评论 #37245719 未加载

zhzy0077over 1 year ago

I'm a noob. Looking at the disasm: <a href="https://godbolt.org/z/766aPTPc3" rel="nofollow noreferrer">https://godbolt.org/z/766aPTPc3</a> It turns a CMOVQLT to a JLT. Is the blog saying CMOVQLT don't have branch predication? I don't get it.

评论 #37245284 未加载

schemescapeover 1 year ago

Why is there a "continue" at all in the first code sample?Edit to add: does removing it make any difference?

评论 #37247602 未加载

评论 #37245305 未加载

评论 #37245148 未加载

评论 #37259284 未加载

评论 #37246911 未加载

Liquid_Fireover 1 year ago

I was curious what this strange assembly language was, as it looked like neither Arm nor x86.Apparently the Go toolchain has its own assembly language which partially abstracts away some architectural differences: <a href="https://go.dev/doc/asm" rel="nofollow noreferrer">https://go.dev/doc/asm</a>I wonder what the advantages are? It feels like as soon as you move away from the basics, the architecture-specific differences will negate most usefulness of the abstraction.

评论 #37248191 未加载

评论 #37264380 未加载

deschutesover 1 year ago

The explanation is not convincing.My guess is some kind of measurement error or one of the "load bearing nop" phenomena. By that I mean the alignment of instructions (esp branch targets?) can dramatically affect performance and compilers apparently have rather simplistic models for this or don't consider it at all.

smclover 1 year ago

Does Go have any facility for providing hints to the optimiser (like how some C compilers support #pragmas) that could cause the branch-predicted instruction to be used rather than the slower one?

评论 #37246994 未加载

asmorover 1 year ago

Go can have unexpected performance differences way higher up in the stack.Ask me about that one time I optimized code that was deadlocking because the Go compiler managed to not insert a Gosched[1] call into a loop transforming data that took ~30 minutes or so. The solution could've been to call Gosched, but optimizing the loop to a few seconds turned out to be easier.I assume the inverse - the go compiler adding too many Goscheds - can happen too. It's not that expensive - testing a condition - but if you do that a few million times, things add up.[1]: <a href="https://pkg.go.dev/runtime#Gosched" rel="nofollow noreferrer">https://pkg.go.dev/runtime#Gosched</a>

评论 #37250845 未加载

Exumaover 1 year ago

There’s a go course that was really good about this level of nuance. He talks a lot about mechanical sympathy and how to dig in detail with this. I think it’s called ultimate go?

romsharkover 1 year ago

I tried to come up with the most efficient implementation of this rather simple function that I could think of with pure Go without going down to SIMD Assembly: <a href="https://go.dev/play/p/zHFxwvWOoeT" rel="nofollow noreferrer">https://go.dev/play/p/zHFxwvWOoeT</a>-32.31% geomean across the different tests looks rather great. Any ideas how to make it even faster?

AshleysBrainover 1 year ago

Most languages have a `max` function, so the core of the loop could be written with just something like: `maxV = max(maxV, v)`That could be entirely branchless, right?

评论 #37248817 未加载

vsnfover 1 year ago

Kind of tangential, but who are these people who are so comfortable with disassembling a high level language binary, reading assembly, and then making statements about branch prediction and other such low level esoterica? I've only ever meet people like that maybe two or thee times in my career, and yet it seems like every other blog post I read in certain language circles everyone is some kind of ASM and Reverse Engineering expert.

评论 #37246544 未加载

评论 #37246134 未加载

评论 #37247288 未加载

评论 #37246173 未加载

评论 #37246123 未加载

评论 #37247305 未加载

评论 #37247505 未加载

评论 #37246696 未加载

评论 #37246112 未加载

评论 #37247087 未加载

评论 #37247363 未加载

评论 #37247240 未加载

评论 #37249628 未加载

评论 #37246789 未加载

评论 #37246882 未加载

评论 #37247527 未加载

评论 #37247601 未加载

评论 #37247741 未加载