So in the implementation of <i>cos_table_*_LERP</i>, you did technically 2 step range reduction:<p>1. Reduce <i>x = x mod 2*pi</i><p>2. Reduce <i>index = floor(x / 10^-n)</i>, and <i>i - index = 10^n * (x mod 10^-n)</i><p>With limited input range and required precision as in the tests, you can combine these 2 range reduction steps:<p>1. Choose the reduced range as power of 2 instead of power of 10 for cheaper modulus operation, let say <i>2^-N = 2^-7</i>.<p>2. Avoid the division in <i>modd(x, CONST_2PI)</i> by multiplying by <i>2^N / pi</i>.<p>3. Avoid the round trip <i>double -> int -> double</i> by using the <i>floor</i> function / instruction.<p>Here is the updated version of <i>cos_table_*_LERP</i> which should have higher throughput and lower latency:<p><pre><code> double cos_table_128_LERP(double x) {
x = fabs(x);
double prod = x * TWO_TO_SEVEN_OVER_PI;
double id = floor(prod);
double x = prod - id; /* after this step, 0 <= x < 1 */
int i = ((int)id) & 0xFF; /* i = id mod 2^8 */
return lerp(x, COS_TABLE[i], COS_TABLE[i + 1]);
}
</code></pre>
You can also optimize <i>lerp</i> a bit more with the formula:<p><pre><code> lerp(w, v1, v2) = (1 - w) * v1 + w * v2 = w * (v2 - v1) + v1
</code></pre>
We do employ this range reduction strategy in a more accurate way for trig functions in LLVM libc:<p><pre><code> - with FMA instructions: https://github.com/llvm/llvm-project/blob/main/libc/src/math/generic/range_reduction_fma.h
- without FMA instructions: https://github.com/llvm/llvm-project/blob/main/libc/src/math/generic/range_reduction.h</code></pre>