An adventure in trying to optimize math.Atan2 with Go assembly

185 pointsby tetraodonpufferover 7 years ago

16 comments

dvtover 7 years ago

Although this post is interesting, the best way to optimize trig functions is generally via rational approximations[1] if you can live with the error (and the clamping). I remember finding a rational function to approximate a tanh-like soft clipper via a Padé approximator on (-1, 1) a few years ago[2]. See an equivalent rational approximation for atan2 here[3].[1] <a href="http://www.embedded.com/design/other/4216719/Performing-efficient-arctangent-approximation" rel="nofollow">http://www.embedded.com/design/other/4216719/Performing-effi...</a>[2] <a href="https://stackoverflow.com/questions/6118028/fast-hyperbolic-tangent-approximation-in-javascript/6118100#6118100" rel="nofollow">https://stackoverflow.com/questions/6118028/fast-hyperbolic-...</a>[3] <a href="https://www.dsprelated.com/showarticle/1052.php" rel="nofollow">https://www.dsprelated.com/showarticle/1052.php</a>

评论 #15126077 未加载

评论 #15126778 未加载

评论 #15127081 未加载

评论 #15125504 未加载

Jyaifover 7 years ago

It should be noted that if you are computing sin/cos after atan2, then it's likely that what you are doing could be done with some regular vector manipulation (normalisation, possibly dot product, cross product).

评论 #15125099 未加载

drvover 7 years ago

The assembly version is using the packed version of the FMA instruction (that's what the "P" in the mnemonic stands for), but as far as I can tell, it's only using one of the packed values, whereas the instruction can calculate two (AVX) or four (AVX2) FMA operations at once. It might be possible to get some speedup by rearranging the calculation so it can use the full width of the vector registers - at first glance, at least the two sides of the division should be possible to calculate in parallel with half as many FMA instructions.

CorvusCryptoover 7 years ago

If anyone was curious why the instruction address didn't match up, it's because he read the instruction identifier wrong. It's also totally counterintuitive and you have to essentially read the full documentation to get to any real information about it. Anyway here we go. VEX is a prefix notation with specialised encodings:C4 - For a three-byte instruction (which vfmadd is) this is C4, if a two-byte instruction this would be C5.E2 - This one is more involved but basically for this instruction the first three bits are 1 for setting some modes on it (R, X, and B). Then, because this is an 0f38H instruction, the next 5 bits are 00010. Altogether you get 11100010 (E2)E9 - this is also involved, but the first bit is a W mode. This is 1 for the instruction he used but it's pretty much ignorable. for bits 2-5 it has to do with the first source register the instruction uses. This is 1101 because it uses XMM2/YMM2 (it's by lookup in a table) first. bit 6 is set to 0 if it is a 128 bit vector (it is). bit 7-8 are set to 01 based on a lookup table for the "pp" value which is 66 in the instruction identifier. Altogether that's 11101001 (E9).A8 - this one is easy, it's the opcode.C3 - this is the ModR/M byte which is actually used in this case for reporting the register/memory operands. first 2 bits are 11 to indicate the first operand is the register itself (not displaced, or truncated). The next 6 bytes are the register codes. 000 is actually not used since the destination is overriden for float maths), and the 011 is EBX as the base cpu register to source the data.Altogether it works out to be C4E2E9A8C3. Not intuitive at all really.Edit: Please someone correct me if I missed something. I hate this kind of stuff and I'm sure I made a mistake.

flavio81over 7 years ago

> "The reason for optimizing the math.Atan2 function is because my current work involves performing some math calculations. And the math.Atan2 call was in the hot path. "He should take a look at:a) the range of x which will be used for atan2(x)b) the level of precision he really needsThere is a chance he could get away with precomputing an atan2(x) table in memory and then just reading the values off the table.Quick as lightning, and an old trick, probably invented in the 1950s!!EDIT: I see i am potentially wrong, due to slower DRAM access compared to some built-in FP instructions...

评论 #15130929 未加载

评论 #15134343 未加载

评论 #15129438 未加载

评论 #15126447 未加载

jcranmerover 7 years ago

Not to be mean to the author, but this is an example of why it's generally useless to try to resort to assembly to beat a compiler unless you're an expert at assembly. The sort of case that's being optimized here--a straight-line piece of assembly--is the case where compilers are going to do better than hand-coded except in specific cases.The first critical failure is using the source code to guide assembly optimization, rather than the original generated assembly. It means that you're going to miss what optimizations the compiler is doing to improve this, and missing those optimizations is going to turn out worse than the original.There are two main optimizations that come to mind. The first is instruction scheduling. On newer CPUs (Haswell or later), you have two separate instruction ports for vector ALU instructions, so you can execute two instructions in parallel if there are no conflicts. If you interleave the instructions for computing the denominator and the numerator, you can take advantage of the twin execution ports and execute two instructions at once instead of two (you use more registers though). This is the main optimization that I'm sure the Go library is getting.The other possible optimization is SLP vectorization: the numerators and denominators are both polynomial interpolation, so they're doing the same operations on data vectors. This means that you can stick both of them in the vector and use one instruction to compute the operation for numerator and denominator simultaneously. On the other hand, it's not immediately clear if the extra cost in vector formation is going to be worth the win in speed, particularly given that you can already simultaneously schedule the two instructions in the first operation. I don't know if Go is applying this optimization.So if you're wondering why it's slower, it's almost certainly that the hand-written assembly is actually poorly scheduled.

评论 #15125761 未加载

评论 #15125953 未加载

Waterluvianover 7 years ago

I dabbled a little in gaming with Unity. I discovered that atan2 is called a LOT. At least... I used it a lot.Shouldn't something like this be implemented as a hardware level operation? Or is it? Or why can't it be? Or maybe it's not used nearly as much as I think?

评论 #15124422 未加载

评论 #15124395 未加载

cprover 7 years ago

Wow, this brings back vivid memories of late 80's, when I was assigned to write atan2 by hand on Multiflow's 28/14/7-wide VLIW machines.(Everyone in the company who could hand-code (even us OS guys and gals ;-) was assigned a math routine--it was an all-hands-on-deck kind of performance situation in our competition with Alliant and Convex.)Imagine trying to hand-schedule up to 28 operations at once, including all memory loads with 12 or 14 (forget now) clocks to memory, all bus contending had to be done by hand, etc, etc. What fun! That's why VLIW compilers were so challenging to write: all resources software-managed.But, alas, in the end the mini-super market died. We all lost to Sun eventually, as everyone wanted his own machine, even if much less powerful--never bet against having the computer power closer to the user (starting with mainframes): minicomputer / workstation / PC / mobile / etc, right down the line to today.(What's going to displace mobile, IoT? Doesn't seem likely. Perhaps when you reach the limit of what you can hold in your hand, that's the last form factor? I don't believe wearable computing will ever really catch on, but I'm old fustian.)

评论 #15126755 未加载

评论 #15134469 未加载

skykoolerover 7 years ago

I always find it interesting to read stories of experiments that didn't end up working.

评论 #15125900 未加载

gigatexalover 7 years ago

Dang. Props to the blogger for posting even though success wasn’t had. Hours on end doing assembly is a feat unto itself. I did not see the twist at the end, I was thinking for sure there’d be a huge speed up. But knowledge is knowledge.

评论 #15125062 未加载

b4ux1t3over 7 years ago

I would love to see more posts like this. Seeing your thought process when approaching a problem, even if you don't end up solving the problem, or don't do it well, is valuable insight into how other people work, and how I can possibly integrate some of your methods to my own workflow.Thanks for posting this.

评论 #15125072 未加载

MichaelBurgeover 7 years ago

I'm surprised it doesn't eliminate the BenchmarkAtan2 entirely:* math.Atan2 is in the standard library, so it should know it has no side effects and can be replaced with an empty statement. Unless Go has a way to mark a function as not having side effects, no user-defined function would have that property.* The only observable impact of the loop if math.Atan2 is removed is that b.N could be a larger size type than n, so it could loop forever depending on the values. If they're known to be the same type, it could eliminate the loop entirely.Are you sure you're compiling with optimizations enabled?I also wonder if that DIVSD instruction in yours might cause a floating point exception, which would be a different behavior compared to atan2.

评论 #15124272 未加载

评论 #15125171 未加载

kristianpover 7 years ago

I'm no optimisation expert, but I'm wondering if the FMAs are slow because the result of each one is dependent on the previous one? The dependency on the result may mean that the processor can't pipeline the operations. Could it be faster if the two chains of FMAs on either side of the division are interleaved and use different registers?z := x * xz = z * fma(fma(fma(fma(P0, z, P1), z, P2), z, P3), z, P4) / fma(fma(fma(fma((z+Q0), z, Q1), z, Q2), z, Q3), z, Q4)z = fma(x,z,x)This article is quite the nerd-snipe.

评论 #15137887 未加载

xioxoxover 7 years ago

In C/C++ it would be much easier to call the intrinsic functions which represent the fma instructions. The compiler would then be able to inline and optimize the whole.

评论 #15125384 未加载

agnivadeover 7 years ago

Author here. Glad to answer any questions you guys have :)

SolarNetover 7 years ago

I think the primary thing you are missing to get the speed improvement is actually doing the packed computations (e.g. computing two values at once; e.g. the numerator and denominator).