Speeding up atan2f

245 pointsby rostayobalmost 4 years ago

21 comments

gumbyalmost 4 years ago

Because of the disregard for the literature common in CS I loved this part:> This is achieved through ...and some cool documents from the 50s.A bit of anecdote: back when I was a research scientist (corporate lab) 30+ years ago I would in fact go downstairs to the library and read — I was still a kid with a lot to learn (and still am). When I came across (by chance or by someone’s suggestion) something useful to my work I’d photocopy the article and and try it out. I’d put a comment in the code with the reference.My colleagues in my group would give me (undeserved) credit for my supposed brilliance even though I said in the code where the idea came from and would determinedly point to the very paper on my desk. This attitude seemed bizarre as the group itself was producing conference papers and even books.(Obviously this was not a universal phenomenon as there were other people in the lab overall, and friends on the net, suggesting papers to read. But I’ve seen it a lot, from back then up to today)

评论 #28211018 未加载

评论 #28210110 未加载

评论 #28210819 未加载

评论 #28210111 未加载

评论 #28212196 未加载

评论 #28213387 未加载

评论 #28215545 未加载

评论 #28214116 未加载

jacobolusalmost 4 years ago

If someone wants a fast version of x ↦ tan(πx/2), let me recommend the approximation:<pre><code> tanpi_2 = function tanpi_2(x) { var y = (1 - x*x); return x * (((-0.000221184 * y + 0.0024971104) * y - 0.02301937096) * y + 0.3182994604 + 1.2732402998 / y); } </code></pre> (valid for -1 <= x <= 1)<a href="https://observablehq.com/@jrus/fasttan" rel="nofollow">https://observablehq.com/@jrus/fasttan</a> with error: <a href="https://www.desmos.com/calculator/hmncdd6fuj" rel="nofollow">https://www.desmos.com/calculator/hmncdd6fuj</a>But even better is to avoid trigonometry and angle measures as much as possible. Almost everything can be done better (faster, with fewer numerical problems) with vector methods; if you want a 1-float representation of an angle, use the stereographic projection:<pre><code> stereo = (x, y) => y/(x + Math.hypot(x, y)); stereo_to_xy = (s) => { var q = 1/(1 + s*s); return !q ? [-1, 0] : [(1 - s*s)/q, 2*s/q]; }</code></pre>

评论 #28212698 未加载

aj7almost 4 years ago

Around 1988, I added phase shift to my optical thin film design program written in Excel 4.0 for the Mac. At the time, this was utterly unique: each spreadsheet row represented a layer and the matrices describing each layer could be calculated right in that row by squashing them down horizontally. The S- and P-polarization matrices could be recorded this way, and the running matrix products similarly maintained. Finally, using a simple one input table, the reflectance of a typically 25-31 layer laser mirror could be calculated. And in less than a second on a 20 MHz 68020 (?) Mac II for about 50 wavelengths. The best part were the graphics which were instantaneous, beautiful, publishable, and pasteable into customer quotations. Semi-technical people could be trained to use the whole thing.Now about the phase shift. In 1988, atan2 didn’t exist. Anywhere. Not in FORTRAN, Basic, Excel, or a C library. I’m sure phase shift calculators implemented it, each working alone. For us, it was critical. You see we actually cared not about the phase shift, but its second derivative, the group delay dispersion. This was the beginning of the femtosecond laser era, and people needed to check whether these broadband laser pulses would be inadvertently stretched by reflection off or transmission through the mirror coating. So atan2, the QUADRANT-PRESERVING arc tangent, is required for a continuous,differential phase function. An Excel function macro did this, with IF statements correcting the quadrant. And the irony of all this?I CALLED it atan2.

评论 #28213813 未加载

评论 #28212917 未加载

评论 #28214883 未加载

drejalmost 4 years ago

Nice. Reminds me of an optimisation trick from a while ago: I remember being bottlenecked by one of these trigonometric functions years ago when working with a probabilistic data structure... then I figured the input domain was pretty small (a couple dozen values), so I precomputed those and used an array lookup instead. A huge win in terms of perf, obviously only applicable in these extreme cases.

评论 #28209677 未加载

评论 #28211653 未加载

评论 #28209650 未加载

Const-mealmost 4 years ago

I wonder how does it compare with Microsoft’s implementation, there: <a href="https://github.com/microsoft/DirectXMath/blob/jan2021/Inc/DirectXMathVector.inl#L4915-L5104" rel="nofollow">https://github.com/microsoft/DirectXMath/blob/jan2021/Inc/Di...</a>Based on the code your version is probably much faster. It would be interesting to compare precision still, MS uses 17-degree polynomial there.

评论 #28210078 未加载

stephencanonalmost 4 years ago

> if we’re working with batches of points and willing to live with tiny errors, we can produce an atan2 approximation which is 50 times faster than the standard version provided by libc.Which libc, though? I assume glibc, but it's frustrating when people talk about libc as though there were a single implementation. Each vendor supplies their own implementation, libc is just a common interface defined by the C library. There is no "standard version" provided by libc.In particular, glibc's math functions are not especially fast--Intel's and Apple's math libraries are 4-5x faster for some functions[1], and often more accurate as well, for example (and both vendors provide vectorized implementations). Even within glibc versions, there have been enormous improvements over the last decade or so, and for some functions there are big performance differences depending on whether or not -fno-math-errno is specified. (I would also note that atan2 has a lot of edge cases, and more than half the work in a standards-compliant libc is in getting those edge cases with zeros and infinities right, which this implementation punts on. There's nothing wrong with that, but that's a bigger tradeoff for most users than the small loss of accuracy and important to note.)So what are we actually comparing against here? Comparing against a clown-shoes baseline makes for eye-popping numbers, but it's not very meaningful.None of this should really take away from the work presented, by the way--the techniques described here are very useful for people interested in this stuff.[1] I don't know the current state of atan2f in glibc specifically; it's possible that it's been improved since last I looked at its performance. But the blog post cites "105.98 cycles / element", which would be glacially slow on any semi-recent hardware, which makes me think something is up here.

评论 #28214937 未加载

评论 #28215006 未加载

hderschalmost 4 years ago

A comparison with one of the many SIMD-mathlibraries would have been fairer than with plain libm. Long time ago I wrote such a dual-platform library for the PS3 (cell-processor) and x86 architecture (outdated, but still available here [1]). Depending on how standard libm implements atan2f, a speedup of 3x to 15x is achieved, without sacrifying accuracy.1. <a href="https://webuser.hs-furtwangen.de/~dersch/libsimdmath.pdf" rel="nofollow">https://webuser.hs-furtwangen.de/~dersch/libsimdmath.pdf</a>

prionassemblyalmost 4 years ago

I wonder whether Padé approximants are well known by this kind of researcher. E.g. <a href="http://www-labs.iro.umontreal.ca/~mignotte/IFT2425/Documents/EfficientApproximationArctgFunction.pdf" rel="nofollow">http://www-labs.iro.umontreal.ca/~mignotte/IFT2425/Documents...</a>

评论 #28210492 未加载

jvz01almost 4 years ago

I have developed very fast, accurate, and vectorizable atan() and atan2() implementations, leveraging AVX/SSE capabilities. You can find them here [warning: self-signed SSL-Cert].<a href="https://fox-toolkit.org/wordpress/?p=219" rel="nofollow">https://fox-toolkit.org/wordpress/?p=219</a>

评论 #28212094 未加载

评论 #28211829 未加载

drfuchsalmost 4 years ago

Wouldn’t CORDIC have done the trick faster? There’s no mention that they even considered it, even though it’s been around for half a century or so.

评论 #28209778 未加载

评论 #28209836 未加载

zokieralmost 4 years ago

I would have wished to see the error analysis section expanded a bit, or maybe seeing some sort of tests to validate the max error. In particular if the mathematical approximation function arctan* has max error of 1/10000 degrees then I'd naively expect that the float-based implementation to have worse error. Furthermore it's not obvious if additional error could be introduced by the division<pre><code> float atan_input = (swap ? x : y) / (swap ? y : x);</code></pre>

shooalmost 4 years ago

Related -- there's a 2011 post from Paul Minero with fast approximations for logarithm, exponential, power, inverse root. <a href="http://www.machinedlearnings.com/2011/06/fast-approximate-logarithm-exponential.html" rel="nofollow">http://www.machinedlearnings.com/2011/06/fast-approximate-lo...</a>Minero's faster approximate log2, < 1.4% relative error for x in [1/100, 10]. Here's the simple non-sse version:<pre><code> static inline float fasterlog2 (float x) { union { float f; uint32_t i; } vx = { x }; float y = vx.i; y *= 1.1920928955078125e-7f; return y - 126.94269504f; } </code></pre> This fastapprox library also includes fast approximations of some other functions that show up in statistical / probabilistic calculations -- gamma, digamma, lambert w function. It is BSD licensed, originally lived in google code, copies of the library live on in github, e.g. <a href="https://github.com/etheory/fastapprox" rel="nofollow">https://github.com/etheory/fastapprox</a>It's also interesting to read through libm. E.g. compare Sun's ~1993 atan2 & atan:<a href="https://github.com/JuliaMath/openlibm/blob/master/src/e_atan2.c" rel="nofollow">https://github.com/JuliaMath/openlibm/blob/master/src/e_atan...</a><a href="https://github.com/JuliaMath/openlibm/blob/master/src/s_atan.c" rel="nofollow">https://github.com/JuliaMath/openlibm/blob/master/src/s_atan...</a>

pklausleralmost 4 years ago

(Undoubtedly) stupid question: would it be any faster to project (x, y) to the unit circle (x', y'), then compute acos(x') or asin(y'), and then correct the result based on the signs of x & y? When converting Cartesian coordinates to polar, the value of r=HYPOT(x, y) is needed anyway, so the projection to the unit circle would be a single division by r.

sorenjanalmost 4 years ago

How do you handle arrays of values where the array lengths are not a multiple of 8 in this kind of vectorized code? Do you zero pad them before handling them to the vectorized function, or do you run a second loop element by element on the remaining elements after the main one? What happens if you try to do `_mm256_load_ps(&ys[i])` with < 8 elements remaining?

评论 #28210632 未加载

评论 #28211150 未加载

评论 #28210868 未加载

评论 #28210597 未加载

aconz2almost 4 years ago

Nice writeup and interesting results. I hadn't seen the use of perf_event_open(2) before directly in code which looks cool.The baseline is at a huge disadvantage here because the call to atan2 in the loop never gets inlined and the loop doesn't seem to get unrolled (which is surprising actually). Manually unrolling by 8 gives me an 8x speedup. Maybe I'm missing something with the `-static` link but unless they're using musl I didn't think -lm could get statically linked.

评论 #28211610 未加载

评论 #28214007 未加载

nice2meetualmost 4 years ago

I did something similar for tanh once, though I found I could get to 1 ulp.Part of the motivation was that I could get 10x faster than libc. However, I then tried on my FreeBSD and could only get 4x faster. After a lot of head scratching and puzzling it turned out there was a bug in the version of libc on my linux box that slowed things down. It kind of took the wind out of the achievement, but it was still a great learning experience.

azhenleyalmost 4 years ago

This is pretty similar to my quest to make my own cos() when my friend didn't have access to libc. It was fun! Though I don't have the math or low-level knowledge that this author does.<a href="https://web.eecs.utk.edu/~azh/blog/cosine.html" rel="nofollow">https://web.eecs.utk.edu/~azh/blog/cosine.html</a>

spaetzleesseralmost 4 years ago

I am envious of people who can deal with such problems. The problem is defined clearly and can be measured easily.This is so much more fun than figuring out why some SAAS service is misbehaving.

cogman10almost 4 years ago

I'm actually a bit surprised that the x86 SIMD instructions don't support trig functions.

h0miealmost 4 years ago

Love posts like this!

unemphysbroalmost 4 years ago

coolest blog post I've seen here in a while. :)