Implementing Cosine in C from Scratch (2020)

220 点作者 alraj将近 2 年前

36 条评论

jpfr将近 2 年前

The Taylor expansion locally fits a polynom based on the n first derivatives. If you want to find the "best" nth-degree polynom to approximate the sine function, functional analysis gives the tools for solving that optimization in closed form.By selecting an appropriate norm (in function space) you can minimize either the maximum error or the error integral over some range (e.g. the 0-\pi range).Here's a video on the subject. You might want to watch earlier ones also for more context.<a href="https://www.youtube.com/watch?v=tMlKZZf2Kac&list=PLdkTDauaUnQpzuOCZyUUZc0lxf4-PXNR5&index=28">https://www.youtube.com/watch?v=tMlKZZf2Kac&list=PLdkTDauaUn...</a>Full disclosure, this is my university lecture on optimization that was recorded during Covid.

评论 #36195543 未加载

azhenley将近 2 年前

No need for archive.org, my website moved a few years ago: <a href="https://austinhenley.com/blog/cosine.html" rel="nofollow">https://austinhenley.com/blog/cosine.html</a>

评论 #36199114 未加载

评论 #36196112 未加载

评论 #36198692 未加载

amiga386将近 2 年前

One of the things I loved most about reading kernel and libc (or rather, libm) sources was the floating-point and FP emulation code. I thought it was immensely cool to see what others just expected an FPU instruction to do, someone had not only written out in C, but also wrote comments explaining the mathematics (rendered in ASCII), often with numerical analysis about error propagation, accuracy, etc.Some examples:<a href="https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=newlib/libm/math/e_pow.c" rel="nofollow">https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=new...</a><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/math-emu/poly_sin.c" rel="nofollow">https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...</a><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/math-emu/README" rel="nofollow">https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...</a>

lntue将近 2 年前

So in the implementation of cos_table_*_LERP, you did technically 2 step range reduction:1. Reduce x = x mod 2*pi2. Reduce index = floor(x / 10^-n), and i - index = 10^n * (x mod 10^-n)With limited input range and required precision as in the tests, you can combine these 2 range reduction steps:1. Choose the reduced range as power of 2 instead of power of 10 for cheaper modulus operation, let say 2^-N = 2^-7.2. Avoid the division in modd(x, CONST_2PI) by multiplying by 2^N / pi.3. Avoid the round trip double -> int -> double by using the floor function / instruction.Here is the updated version of cos_table_*_LERP which should have higher throughput and lower latency:<pre><code> double cos_table_128_LERP(double x) { x = fabs(x); double prod = x * TWO_TO_SEVEN_OVER_PI; double id = floor(prod); double x = prod - id; /* after this step, 0 <= x < 1 */ int i = ((int)id) & 0xFF; /* i = id mod 2^8 */ return lerp(x, COS_TABLE[i], COS_TABLE[i + 1]); } </code></pre> You can also optimize lerp a bit more with the formula:<pre><code> lerp(w, v1, v2) = (1 - w) * v1 + w * v2 = w * (v2 - v1) + v1 </code></pre> We do employ this range reduction strategy in a more accurate way for trig functions in LLVM libc:<pre><code> - with FMA instructions: https://github.com/llvm/llvm-project/blob/main/libc/src/math/generic/range_reduction_fma.h - without FMA instructions: https://github.com/llvm/llvm-project/blob/main/libc/src/math/generic/range_reduction.h</code></pre>

sampo将近 2 年前

> cosine repeats every 2pi> We can do better since the cosine values are equivalent every multiple of pi, except that the sign flips.There is one more step to take: Each π/2 long segment has identical shape, they are just pointing up or down (like you already noticed), and left or right. So you can reduce you basic domain down to not just 0...π, but to 0...π/2.

评论 #36195203 未加载

评论 #36198546 未加载

评论 #36200438 未加载

评论 #36195093 未加载

femto将近 2 年前

If doing signal processing, an optimisation is to use an N-bit integer to represent the range 0 to 2.pi. Some example points in the mapping: 0->0, 2^(N-2)->pi/2, 2^(N-1)->pi, 2^N(wraps to 0)->2.pi(wraps to 0).If your lookup table has an M-bit index, the index to the lookup table is calculated with: index = (unsigned)theta>>(N-M), where theta in the N-bit integer representing the angle.The fractional part, which can be used for interpolation, is: theta & ((1<<(N-M))-1).If you choose M=N=word size (16 bits is often nice), the angle can be used directly as an index. With 16-bits accuracy is typically good enough without interpolation and the entire trig operation reduces to an array indexing operation.This minimises the overhead of converting an angle to a table index.

评论 #36199709 未加载

评论 #36196744 未加载

xorvoid将近 2 年前

How good do you need it? Lol.This is the approximation that I used in for the animated sinwave example for SectorC:y ~= 100 + (x*(157 - x)) >> 7<a href="https://github.com/xorvoid/sectorc/blob/main/examples/sinwave.c">https://github.com/xorvoid/sectorc/blob/main/examples/sinwav...</a>

评论 #36207114 未加载

JKCalhoun将近 2 年前

The author's Taylor Series looked salvageable to me. The range between -Pi/2 and Pi/2 looked fine. That useable range can be re-used to serve all other portions of the circle.Add a conditional that applies (1-result) for some angles and you're golden.

amadio将近 2 年前

Taylor expansion is usually not so good for this, you'll fare better with either Legendre or Chebyshev polynomials. Robin Green has some excellent material on the subject, which I am linking below.Faster Math Functions: <a href="https://basesandframes.files.wordpress.com/2016/05/fast-math-functions_p1.pdf" rel="nofollow">https://basesandframes.files.wordpress.com/2016/05/fast-math...</a> <a href="https://basesandframes.files.wordpress.com/2016/05/fast-math-functions_p2.pdf" rel="nofollow">https://basesandframes.files.wordpress.com/2016/05/fast-math...</a>Even faster math functions GDC 2020: <a href="https://gdcvault.com/play/1026734/Math-in-Game-Development-Summit" rel="nofollow">https://gdcvault.com/play/1026734/Math-in-Game-Development-S...</a>

eschneider将近 2 年前

Trig functions are where accuracy and performance go to die. Accumulated error is a thing, so when optimizing always consider exactly how you're going to be using those functions so you make the 'right' tradeoffs for your application. One size definitely doesn't fit all here, so test and experiment.

midjji将近 2 年前

A good idea is to not to compute the values of cos from 0-2pi, but further reduce the range, using cos(a) = cos(-a), and cos(2a) = 1-2cos(a), or cos(a+pi/4) =...So we really only ever need to be able to compute cos in the range 0-pi/4.Then for further accuracy we can do the taylor expansion around pi/8. (or other approximations)finally the number of terms for a fixed accuracy varies with the distance from pi/8,

评论 #36199713 未加载

pacaro将近 2 年前

I went through the same exercise implementing trig functions for scheme in webassembly...It was a rabbit hole for sure<a href="https://github.com/PollRobots/scheme/blob/main/scheme.wasm/src/library/trig.wat#L369">https://github.com/PollRobots/scheme/blob/main/scheme.wasm/s...</a>[Edit: above is just the range from 0-π/4, the branch cuts for cos are at <a href="https://github.com/PollRobots/scheme/blob/main/scheme.wasm/src/library/trig.wat#L266">https://github.com/PollRobots/scheme/blob/main/scheme.wasm/s...</a> ]

mistercow将近 2 年前

This is a fine introduction to how you even approach the problem of computing transcendental functions.It would have been nice to have some discussion of accuracy requirements rather than just eyeballing it and saying “more accuracy than I need”. This is a spot where inexperienced devs can easily get tangled up. An attitude like “ten digits? That’s so many! I’m only making a game, after all” is exactly the sort of thing that gets you into trouble if you start accumulating errors over time, and this is particularly easy to do with trig functions.

dm319将近 2 年前

The answer to Cos(1.57079632) (in radians) will give you a clue as to how your calculator does it.See here [0][0] <a href="https://www.reddit.com/r/calculators/comments/126st95/cos157079632_is_a_bit_of_a_torture_test/" rel="nofollow">https://www.reddit.com/r/calculators/comments/126st95/cos157...</a>

评论 #36204668 未加载

评论 #36200862 未加载

fargle将近 2 年前

It's unusual, outside x86, for these functions to be hardware accelerated. So just about every libm has to do it in software.Nearly all libm implementations around are based off Sun's fdlibm from the 80's-90's, including a bunch of the variants cited below, *bsds, etc. They are basically updated slightly, but you can see their heritage in the structure of the code. The original is found on netlib these days: <a href="https://www.netlib.org/fdlibm/k_cos.c" rel="nofollow">https://www.netlib.org/fdlibm/k_cos.c</a>14th order polynomial, but only uses ~7 terms. It's supposed to have error < 1 ULP. For fdlibm, it's pretty readable compared to some of the other fun ones. I seem to remember sqrt being a bit gnarly.

sojuz151将近 2 年前

I would say the benchmarks with lookup tables are bad. In the benchmark cpu will keep the table in cache but in a real program this cache would have to be shared with rest of the code/data. This would either kill the performance of cosine or rest of the app

mvcalder将近 2 年前

If you like this sort of thing there’s a great book: Methods and Programs for Mathematical Functions by Stephen Moshier. It covers the computation of all sorts of special functions. The code is available in the cephes library but the book may be out of print.

评论 #36199455 未加载

dang将近 2 年前

Related:Implementing Cosine in C from Scratch (2020) - <a href="https://news.ycombinator.com/item?id=30844872" rel="nofollow">https://news.ycombinator.com/item?id=30844872</a> - March 2022 (134 comments)Implementing cosine in C from scratch - <a href="https://news.ycombinator.com/item?id=23893325" rel="nofollow">https://news.ycombinator.com/item?id=23893325</a> - July 2020 (20 comments)

lntue将近 2 年前

So in the implementation of `cos_table__LERP`, you did technically 2 step range reduction:1. Reduce x = x mod 2pi2. Reduce index = 10^n * (x / 10^-n), and i - index = 10^n * (x mod 10^-n)With limited input range and required precision as in the tests, you can combine these 2 range reduction steps:1. Choose the reduced range as power of 2 instead of power of 10 for cheaper modulus operation, let say `2^-N = 2^-7`.2. Avoid the division in `modd(x, CONST_2PI)` by multiplying by `2^N / pi`.3. Avoid the round trip `double -> int -> double` by using the `floor` function / instruction.Here is the updated version of `cos_table__LERP` which should have higher throughput and lower latency:``` double cos_table_128_LERP(double x) { x = fabs(x); double prod = x TWO_TO_SEVEN_OVER_PI;<pre><code> }</code></pre> ```

xipix将近 2 年前

Nice. I'd love to see how this changes when you have SIMD and multiple cosines to compute in parallel. Also when you have to compute sine and cosine simultaneously which is often the case, and then you may be more interested in polar error than cartesian error.

评论 #36195508 未加载

londons_explore将近 2 年前

Note that you can combine the lookup table with the taylor series expansion...And you can even use the same lookup table for each. That means with 2 table lookups in a 32 entry table, a single multiply and add (and a few bit shifts), you can get ~9 bits of precision in your result, which is fine for most uses. It also probably makes a sin operation take ~1 clock cycle on most superscalar architectures, as long as your workload has sufficient instruction parallelism.Note that a smaller table typically works out faster because 32 entries fit in the cache, whereas repeated random entry into a 1024 entry table would probably kick a bunch of other stuff out of the cache that you wanted.

fallingfrog将近 2 年前

Why not do a table, but also store a table of the 1st derivative (which, for cosine, you could use the same table again but shifted)?Then, you could do a 2nd order fit like a spline instead of a straight line between your table values.Betcha it would be crazy fast.

Aardwolf将近 2 年前

" In all my time of coding, there has only been one situation in which I used cosine: games, games, games.<pre><code> function move(obj) { obj.x += obj.speed * Math.sin(obj.rotation); obj.y += obj.speed * Math.cos(obj.rotation); }</code></pre> "Why not store the velocity as a 2D vector instead? Then you still have to use cos/sin to compute this vector, but at least you don't need it every frame, plus often you don't need to use cos/sin to compute this vector either since forces that act on the velocity themselves can have an x and y component you can directly add to it

评论 #36195271 未加载

Technotroll将近 2 年前

Looks to me like you could compute the top half, and then just repeat the rest as a kind of mirror function, that repeats with some set translation. Am I wrong here?

评论 #36199539 未加载

rjmunro将近 2 年前

Can you get some better accuracy by noting that cos(x) === sin(π/2 - x) and using e.g. the taylor expansion for sin when π/4 < x < 3π/4?

Const-me将近 2 年前

I once did that as well: <a href="https://github.com/Const-me/AvxMath/blob/master/AvxMath/AvxMathTrig.cpp">https://github.com/Const-me/AvxMath/blob/master/AvxMath/AvxM...</a>The method is different, and the OP hasn’t mentioned it — high-degree minimax polynomial approximation.

throwawaaarrgh将近 2 年前

Strange. This article was easy to read, simple formating, not driven by trends or weird internet culture. The weren't any bombastic claims, aggrandizing statements or dramatic opinions. Just interesting information presented clearly without being verbose. Almost like it was written by an adult. How did this end up on HN?

评论 #36198557 未加载

adeon将近 2 年前

I like how there's lots of replies showing different ways to do this, improve it, additional nuances and viewpoints etc. lots of smart people.If I ever want better solutions for some programming problem, I'll write a post about it and try to get it in frontpage of HN :)

machina_ex_deus将近 2 年前

I would've done it using complex exponentiation. And the exponential function is very easy to estimate fast: instead of Taylor series, use:e^x=(1+x/n)^npick n=2^N, and that's just a bit shift and N repeated multiplications. Probably much faster and accurate.

aidenn0将近 2 年前

Marz's taylor series implementation as-is is significantly faster (~60% the runtime) than the glibc implementation at 6 terms on My Machine(tm), rather than about the same as in TFA. I haven't compared to LERP lookup table yet though.

londons_explore将近 2 年前

OP's lookup table benchmarking is bad. The "modd(x, CONST_2PI);" at the top will dominate, by far, the runtime.Anyone who wants performance measures angles using n bit fixed point math, mapping 0 to 2*pi as 0 to (2^n)-1.

Helenarttr将近 2 年前

That was so amazing Information. <a href="https://www.ballsportspro.com/how-does-a-pickleball-ladder-work/" rel="nofollow">https://www.ballsportspro.com/how-does-a-pickleball-ladder-w...</a>

charlieyu1将近 2 年前

Related:<a href="https://news.ycombinator.com/item?id=35381968" rel="nofollow">https://news.ycombinator.com/item?id=35381968</a> Cosine Implementation in CMuch better approximation with only 7 terms

082349872349872将近 2 年前

Niklaus Wirth also has some cosine code (probably meant more for software floating point, or fp FPGA blocks); I don't know how they compare with these approximations but his seem to be within 1e-6 of python's math.cos ...<a href="https://people.inf.ethz.ch/wirth/ProjectOberon/Sources/Math.Mod.txt" rel="nofollow">https://people.inf.ethz.ch/wirth/ProjectOberon/Sources/Math....</a>

omgmajk将近 2 年前

>Maybe don't stare at that for too long...I sure did stare at that for way too long.

dang将近 2 年前

Url changed from <a href="https://web.archive.org/web/20210513043002/http://web.eecs.utk.edu/~azh/blog/cosine.html" rel="nofollow">https://web.archive.org/web/20210513043002/http://web.eecs.u...</a>, which points to this.

36 条评论

jpfr将近 2 年前

评论 #36195543 未加载

azhenley将近 2 年前

No need for archive.org, my website moved a few years ago: <a href="https://austinhenley.com/blog/cosine.html" rel="nofollow">https://austinhenley.com/blog/cosine.html</a>

评论 #36199114 未加载

评论 #36196112 未加载

评论 #36198692 未加载

amiga386将近 2 年前

lntue将近 2 年前

sampo将近 2 年前

评论 #36195203 未加载

评论 #36198546 未加载

评论 #36200438 未加载

评论 #36195093 未加载

femto将近 2 年前

评论 #36199709 未加载

评论 #36196744 未加载

xorvoid将近 2 年前

评论 #36207114 未加载

JKCalhoun将近 2 年前

amadio将近 2 年前

eschneider将近 2 年前

midjji将近 2 年前

评论 #36199713 未加载

pacaro将近 2 年前

mistercow将近 2 年前

dm319将近 2 年前

评论 #36204668 未加载

评论 #36200862 未加载

fargle将近 2 年前

sojuz151将近 2 年前

mvcalder将近 2 年前

评论 #36199455 未加载

dang将近 2 年前

lntue将近 2 年前

xipix将近 2 年前

评论 #36195508 未加载

londons_explore将近 2 年前

fallingfrog将近 2 年前

Aardwolf将近 2 年前

评论 #36195271 未加载

Technotroll将近 2 年前

Looks to me like you could compute the top half, and then just repeat the rest as a kind of mirror function, that repeats with some set translation. Am I wrong here?

评论 #36199539 未加载

rjmunro将近 2 年前

Can you get some better accuracy by noting that cos(x) === sin(π/2 - x) and using e.g. the taylor expansion for sin when π/4 < x < 3π/4?

Const-me将近 2 年前

throwawaaarrgh将近 2 年前

评论 #36198557 未加载

adeon将近 2 年前

machina_ex_deus将近 2 年前

aidenn0将近 2 年前

londons_explore将近 2 年前

Helenarttr将近 2 年前

That was so amazing Information. <a href="https://www.ballsportspro.com/how-does-a-pickleball-ladder-work/" rel="nofollow">https://www.ballsportspro.com/how-does-a-pickleball-ladder-w...</a>

charlieyu1将近 2 年前

082349872349872将近 2 年前

omgmajk将近 2 年前

>Maybe don't stare at that for too long...I sure did stare at that for way too long.

dang将近 2 年前