Awesome writeup.<p>I've occasionally thought that there's a lot of time to be saved in machine learning systems by making the floating point operations a bit more noisy. Use maybe eight bit (or even four bit) numbers for the matrix entries, with the philosophy that the overall degree of freedom in the system is enough to make up for this local rigidity, while you gain a lot in terms of both storage space and evaluation time.<p>Along these lines, I saw a great demo of a robot a while ago which used 256 binary devices for movement. Each of these devices is the diagonal of a quadrilateral; when the device is in it's 'on' state, the diagonal is long, and when it's in the 'off' state the diagonal is short. Chain a lot of these together, and you can get pretty good accuracy aiming for specific points in a two-dimensional space. The demo I saw was a three-dimensional version of the same concept, of course.<p>So likewise, we chain together lots of linear operations to build (say) a deep learning system. The weight space is very high-dimensional, and the decision space is the much lower dimensional classification/regression problem: one probably needs a lot less than 64 bits of accuracy in the weight space to approximate a point in the decision space.
From a practical point of view, in case you have Mathematica (which is a proprietary software unfortunately), you can do this easily for any function in a general way:
<a href="http://reference.wolfram.com/language/FunctionApproximations/tutorial/FunctionApproximations.html" rel="nofollow">http://reference.wolfram.com/language/FunctionApproximations...</a><p>(MiniMaxApproximation minimizes the relative maximum error)<p>For the case discussed in write up, it gives<p>MiniMaxApproximation[Log2[x + 1], {x, {1, 2}, 2, 0}]<p>log2(x+1) ~ 0.177392 + 0.943626 x - 0.120232 x^2<p>with a maximum error of -0.000786282.<p>In the 4th order (that is, (4,0)), it gives<p>0.0399284 + 1.27027 x - 0.391233 x^2 + 0.0909038 x^3 -
0.00986308 x^4<p>max error: -4.83031 10^-6.<p>However, if the division isn't too expensive, a (2,2) approximant also works for an even better precision:<p>(0.00383323 + 1.42412 x + 0.336161 x^2)/(1 + 0.704317 x +
0.0598015 x^2)<p>max error: -1.33902 10^-7<p>As for [0.75, 1.5) interval, I get<p>0.108715 + 1.05621 x - 0.165315 x^2<p>which gives a max error -0.000651425 and the max relative error is around 0.0006, which is again way better than what he gives in the write up, although the error is non zero at points 0.75 and 1.5 (which is not really a necessary constraint, but an artifact of the particular polynomial he chose). The overall plot for relative error in a wider range is smooth.<p>Bottom line:
1) Unless you are an expert in numerical methods, consult to a software/article/book by an expert (or to the expert him/herself if you can) in the field at some point.
2) While I appreciate his efforts and the write up, I wouldn't recommend using his polynomials in practice.<p>(BTW, one can use frexp instead of bit-twiddling to make the C code friendlier since not everyone is familiar with the bit layout of IEEE 754 numbers.)
The author of this post, David Goldberg, is the same guy that wrote "What Every Computer Scientist Should Know About Floating-Point Arithmetic."<p><a href="http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html" rel="nofollow">http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.ht...</a>
I've learned to be extra careful with fast log approximation. I was the tools programmer at a game studio, and few years ago was given the task to speed things a bit. There was a process that was summing neighbouring pixels, doing some filtering/pre-processing then calculating a value through log, so I've profiled that this was the slowdown, and decided to provide a faster log version. After some googling, and testing it seems everything should've work.<p>Few days later, black mipmaps started to appear. Turns out, I was doing testing in a wrong way, where I was thinking I'm comparing new to old, where I was comparing old to old created by other machine (a bit different results due to be expected). All because I forgot to turn off caching.<p>And my approximation of log2 did not really worked for the most of the numbers involved in. It was using floats, not doubles to begin with, and producing a lot of either NaN's and infinities (hence the black colors).<p>Anyway it was reverted, and had to reconvert everything to get back to normal.
Another interesting idea is ditching the format of a classic polynomial. Multiplication isn't the fastest thing ever.<p>I wonder what would happen if you ran the same minmax scoring scheme through an general instruction synthesizer for the same complexity cap?
This is a pretty nice introduction, and makes me want to return to my side project where I used to play with these approximations but for SIMD math: <a href="https://github.com/boulos/syrah/blob/master/src/include/syrah/FixedVectorMath.h" rel="nofollow">https://github.com/boulos/syrah/blob/master/src/include/syra...</a> .<p>I hope in a later post he mentions Sollya (<a href="http://sollya.gforge.inria.fr" rel="nofollow">http://sollya.gforge.inria.fr</a>), I found it invaluable for playing around with different precision vs speed.
Why is Ebay using the log function and why is it used so heavily that it causes a performance bottleneck for them? Is this an inherited code base from their Hunch purchase?
Boost used to have a minimax approximation algorithm in its codebase. Unfortunately, nobody knows enough about how it works it to maintain it, and you have to look in much older releases of Boost to get it.
But why is this important? Can't computers already do this? Even my 1978 TI pocket calculator can do logs very fast. It's not like this is an important unsolved mathematics puzzle.
I tried evolving a good approximation with Eureqa: <a href="https://i.imgur.com/1CTTjYD.png?1" rel="nofollow">https://i.imgur.com/1CTTjYD.png?1</a>