Why does this code execute more slowly after strength-reducing multiplications?

434 pointsby ttsiodrasalmost 3 years ago

27 comments

sampoalmost 3 years ago

In the post, multiplications and 2 additions are not faster than 2 additions.The post compares (1) loop code that can be vectorized, as loop rounds are independent and do not depend on the result from the previous round, and (2) an "optimization" that makes calculations shorter, but also makes each loop round depend on the result of the previous round, so this cannot be vectorized.

评论 #31551807 未加载

评论 #31552029 未加载

评论 #31551483 未加载

评论 #31551513 未加载

评论 #31555140 未加载

评论 #31553389 未加载

评论 #31552189 未加载

评论 #31556011 未加载

dragontameralmost 3 years ago

Tldr: autovectorizer transforms the first code into SIMD instructions, but the second into 64 bit instructions.SIMD is very powerful, and modern compilers can sometimes simd-ify your code automatically.-------The second code can probably become SIMD as well, but it's beyond GCC's ability to autovectorizer it in that form. I kinda want to give it a go myself but don't have time today...Autovectorizers are mysterious, often harder to see and use than explicitly SIMD code (like OpenCL or CUDA).

评论 #31550941 未加载

评论 #31551049 未加载

评论 #31550792 未加载

评论 #31552784 未加载

评论 #31550942 未加载

评论 #31551419 未加载

评论 #31551375 未加载

mvuksanoalmost 3 years ago

I think it's worth pointing out that the reason why these two examples execute at different speed is due to how compiler translated code AND because CPU was able to parallelize work. Compilers take knowledge about target platform (e.g. instruction set) and code and translate it into executable code. Compiler CAN (but doesn't have to) rewrite code only if it ALWAYS produces the same result as input code.I feel like last 110-15 years (majority of) people have stopped thinking about specific CPU and only think about ISA. That works for a lot of workloads but in recent years I have observed that there is more and more interest in how specific CPU can execute code as efficiently as possible.If you're interested in the kind of optimizations performed in the example you should check out polyhedral compilation (<a href="https://polyhedral.info/" rel="nofollow">https://polyhedral.info/</a>) and halide (<a href="https://halide-lang.org/" rel="nofollow">https://halide-lang.org/</a>). Both can be used to speed up certain workloads significantly.

manholioalmost 3 years ago

Seems like you can have the cake and eat it too, by manually parallelizing the code to something like this:<pre><code> double A4 = A+A+A+A; double Z = 3A+B; double Y1 = C; double Y2 = A+B+C; int i; // ... setup unroll when LEN is odd... for(i=0; i<LEN; i++) { data[i] = Y1; data[++i] = Y2; Y1 += Z; Y2 += Z; Z += A4; } </code></pre> Probably not entirely functional as written, but you get the idea: unroll the loop so that the data dependent paths can each be done in parallel. For the machine being considered, a 4 step unroll should achieve maximum performance, but of course, you get all the fun things that come with hard-coding the architecture in your software.

评论 #31554566 未加载

评论 #31554245 未加载

oconnor663almost 3 years ago

The part about data dependencies across loop iterations is fascinating to me, becuase it's mostly invisible even when you look at the generated assembly. There's a related optimization that comes up in implementations of ChaCha/BLAKE, where we permute columns around in a kind of weird order, because it breaks a data dependency for an operation that's about to happen: <a href="https://github.com/sneves/blake2-avx2/pull/4#issuecomment-502507027" rel="nofollow">https://github.com/sneves/blake2-avx2/pull/4#issuecomment-50...</a>

评论 #31554497 未加载

someweirdpersonalmost 3 years ago

I doubt repeatedly adding floating point numbers is a good idea. With every addition the sum increases further away from the addend, and with their growing relative difference problems grow as well.Just because the algebra works it doesn't guarantee that the code does, too.

评论 #31551799 未加载

评论 #31551659 未加载

评论 #31553581 未加载

foxesalmost 3 years ago

It’s almost like managing the state yourself on a modern cpu is a complicated task. Automatic vectorisation, reordering, etc can all be more easily done to a pure declarative program. Imperatively managing the state and the control flow with an ancient language like C etc really does no longer reflect the underlying hardware which is significantly more advanced.

savant_penguinalmost 3 years ago

What would be the difference in power consumption from each method? (Would it be always better to multiply? If so why not multiply by one?)

评论 #31551320 未加载

评论 #31551064 未加载

评论 #31550946 未加载

评论 #31558296 未加载

ww520almost 3 years ago

Besides the autovectorization, the second version also has two additional assignments. Depending on how the storage is used for these, whether it’s to registers, L1/L2, or stack, there might be performance hit.

评论 #31554468 未加载

mgaunardalmost 3 years ago

It's fairly obvious: the rewrite prevents parallelization because floating-point isn't associative.You'd need to parallelize it explicitly (which can be done by just unrolling the loop).

ribitalmost 3 years ago

First: we need to finally stop the harmful myth that floating point multiplication is slower than addition. This has not been true for a long while.Second: why are so many people insisting that the loop is auto-vectorised? Is there any evidence to that? Data dependencies alone explain the observed performance delta. Auto-vectorization would have resulted in a higher speedup.

tomxoralmost 3 years ago

Vectorisation is not free, There is one other dimension to optimise for: power.The suggested "slower" optimisation does fundamentally use less instructions. Chucking more hardware at parallelisable problems makes it run faster but does not necessarily reduce the power requirements much because there are fundamentally the same number of instructions, it's just gobbling up the same power over a shorter period of time - The "slower" serial algorithm uses less instructions and in theory less power in total. Disclaimer: I mean power vs speed fundamentally in theory, in practice there may be no measurable difference depending on the particular CPU and code.

评论 #31554055 未加载

评论 #31554648 未加载

评论 #31555764 未加载

评论 #31554414 未加载

评论 #31562127 未加载

andyjohnson0almost 3 years ago

I curious about how an optimiser determines that a block of code can be vectorised. It's trivial to see this in the initial version of compute() but I'm not sure how an optimiser does this. Is it as simple as checking that A, B, and C are cost?And how far would an optimiser typically take its analysis? For example, if B was defined inside the loop as (non -const) A * 2 ? Or as A * f() , where f() always returns 2, maybe using a constant or maybe a calculation etc.Seems like a very hard problem,

benreesmanalmost 3 years ago

Oh man I have been smoked by data-dependency like this so many times. I've gotten a lot better over the years at "seeing" data dependencies in godbolt or whatever, but it still slips through by my code and that of my colleagues way more often than I'd like.Is anyone aware of good tooling for automatically catching this sort of thing even some of the time?

评论 #31553314 未加载

评论 #31554523 未加载

btdmasteralmost 3 years ago

I'm running into an interesting result -- on my machine, the fast version runs at the same speed at LEN=1000000 (the default in the sample), but starts running 1.5 times faster at LEN=500000 and ends up twice as fast at LEN=100000 and lower. This is with gcc -O3. Why could this be?

评论 #31557436 未加载

shp0nglealmost 3 years ago

For fun, I tried this in go on M1 Mac (basically because benchmarking in go is so easy)And... the two codes runs with exactly the same speed, on M1 Mac, with go.edit: of course, with go, you can manually parallelize the faster option with goroutines... but that does something else, doesn't it. (and it's 500x faster.)

评论 #31555594 未加载

oezialmost 3 years ago

Funny that this question gained 180 upvotes in 10 days when it also could have received the reverse for being quite lacking in things the author has tried to figure the (rather obvious) data dependency out on his own.

评论 #31553576 未加载

lawrenceyanalmost 3 years ago

Given that in terms of “absolute work” done, the optimization does hold true, is there any situation where it would be beneficial to implement this?(Super low energy processors, battery powered, etc?

评论 #31555579 未加载

desperatealmost 3 years ago

How would the energy cost of the two methods compare?

Beltirasalmost 3 years ago

Is the implication here that tail call optimizations don't work anymore? They might seem to do the proper thing on the language level but the CPU just can't think that way.

评论 #31551374 未加载

评论 #31551290 未加载

AtNightWeCodealmost 3 years ago

On older arcs there is also a cost for casting int to float. Not like float to int but anyway.The general advice when I worked with CG was to use deltas in iterations.

paraditealmost 3 years ago

Is this kind of parallelization possible on only compiled language? Or is it also possible for interpreted language like JavaScript?

评论 #31557155 未加载

dgb23almost 3 years ago

If we ignore the details, then this is a perfect example of how simpler code should be preferred _by default_ over complex code. Both for performance and reasoning. In this case and often elsewhere, these are related things.I'm taking the definition of simple and complex (or complected) from this talk[0]. In short: Simple means individual pieces are standing on their own, not necessarily easy or minimal. Complex means individual pieces are braided together into a whole.The first code is simpler than the second. Just have a look at the two solutions and you see what I mean: The individual instructions in the first stand on their own and clearly declare what they mean as well. The second example is more complex, because each line/subexpression cannot be understood in isolation, there is an implicit coupling that requires the programmer _and_ the machine code interpreter to understand the whole thing for it to make sense. The CPU apparently fails at that and cannot optimize the machine instructions into more efficient microcode.The example illustrates performance benefits of simpler code, but there are others too:For example a type checker might be able to infer things from simpler code, but not from complex code. Again, complexity is about coupling, which can for example arise from making assumptions about things that are out of bounds or logic that happens in some other part of your program, which _this_ part relies on.Things like these can either be outright rejected by a type checker or simply ignored and sometimes we can provide additional ceremony to make it happy. But there is always an underlying question when these issues arise: Can I express this in a simpler way? It's sometimes possible to describe rich semantics with simpler, more primitive types and explicit coordination.Another thing that comes to mind are rich ORMs. They can be very convenient for common cases. But they have footguns for the general case, because they _complect_ so many things, such as validation, storage, domain rules, caching etc. And they are leaky abstractions because we have to do certain things in certain ways to avoid bloated SQL, N+1 etc.They are not simple (and certainly not declarative) so we program against a black box. There are simpler "ORMs" that I very much like, but they typically only provide a minimal set of orthogonal features, such as query builders, and mapping results into plain data structures. It is simpler to use these kind of orthogonal tools.Last but not least: Simple code can be pulled apart and changed more easily without having to worry about introducing problems. The substitution rule or referential transparency enables this by guaranteeing that each expression can be replaced by the value it will evaluate to. This also implies that other expressions that evaluate to the same value can be substituted freely.[0] Simple Made Easy - Rich Hickey at Strangeloop 2011 <a href="https://www.youtube.com/watch?v=LKtk3HCgTa8" rel="nofollow">https://www.youtube.com/watch?v=LKtk3HCgTa8</a>

评论 #31551736 未加载

ummonkalmost 3 years ago

The addition based version should be hand vectorizable if you do the math to figure out the increment based on vector sizes.

diarrheaalmost 3 years ago

Is this relevant to interpreted languages as well? I’m thinking perhaps Pythons bytecode could have similar optimisations.

layer8almost 3 years ago

TLDR: Due to loop-carried dependencies preventing parallelized execution: <a href="https://en.wikipedia.org/wiki/Loop_dependence_analysis#Loop-carried_dependence_vs._loop_independent_dependence" rel="nofollow">https://en.wikipedia.org/wiki/Loop_dependence_analysis#Loop-...</a>In the 2 additions version, computation of the next iteration depends on the results of the preceding iteration. In the multiplication version, the computations are independent for each iteration, enabling parallel execution (by SIMD and/or pipelined/superscalar execution).

mlatualmost 3 years ago

you could try to prefill first cell with -(A+B), then each cell gets a+b+c and in a loop for (j=0; j<i; j++): (A+A)but im to lazy to test that