This looks pretty cool! What's the catch? e.g. why isn't this already implemented in accelerators, is it really just a forgotten algorithm, or this has some implications on the cost of building the accelerator or else?
Man I remembered something similar I had tried working on in 2018, but gave up after all my PhD applications got rejected.<p><a href="https://github.com/ixaxaar/pytorch-dni">https://github.com/ixaxaar/pytorch-dni</a><p>The concept here goes a bit further and tries to replicate backprop with an external network, arguing that that's probably what the brain actually does.
This is very cool and a real interesting read! For those in the comments confused about how this is better, the paper is talking about synthesizing matrix multiplication pipelines in hardware, like an FPGA or ASIC. On a CPU or GPU you won't notice because adds and multiplications take the same amount of time generally, but multiplication units takes up many more transistors, so if you can reduce the circuit complexity you can increase the speed and parallel throughput and reduce power and routing complexity. This approach could be particularly useful for efficient sparse matrix multiplication accelerators.<p>Another cool way to eliminate multiplication in matrix multiplication is to use different semirings [1]. The Tropical Semiring [2] for example substitutes addition for multiplication and min (or max) for addition. It's still matrix multiplication but with substituted binary operations. The research in this relatively new field of Tropical Algebra [3] is quite active and rich right now, being used for all kinds of optimization problems and in research for optimizing neural networks [4]
. This approach also lends itself to hardware synthesis since most FPGA configurable logical blocks can add/min/max in one clock cycle, whereas efficient multiplication requires fixed dedicated on-chip hardware multipliers.<p>Another way to efficiently remove multiplications with a different but related semiring is to use a Log Semiring [5]. If you have to multiply chains of probabilities (like Markov chains) then the numbers quickly become very small and floating point loses its accuracy to represent the numbers. By scaling the numbers first by taking the log, multiplication becomes addition and addition becomes x + log1p(exp(y - x)).<p>[1] <a href="https://en.wikipedia.org/wiki/Semiring" rel="nofollow">https://en.wikipedia.org/wiki/Semiring</a><p>[2] <a href="https://en.wikipedia.org/wiki/Tropical_semiring" rel="nofollow">https://en.wikipedia.org/wiki/Tropical_semiring</a><p>[3] <a href="https://en.wikipedia.org/wiki/Tropical_geometry" rel="nofollow">https://en.wikipedia.org/wiki/Tropical_geometry</a><p>[4] <a href="https://proceedings.mlr.press/v80/zhang18i/zhang18i.pdf" rel="nofollow">https://proceedings.mlr.press/v80/zhang18i/zhang18i.pdf</a><p>[5] <a href="https://en.wikipedia.org/wiki/Log_semiring" rel="nofollow">https://en.wikipedia.org/wiki/Log_semiring</a>
I'm surprised this actually works, usually detecting whether to use multiplication or addition is slower than simply using multiplication. Especially if it's massive amounts of work being done in parallel.
If you're interested in the mathematical theory behind sub-cubic algorithms for matrix multiplications, you can start from here: <a href="https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm#Sub-cubic_algorithms" rel="nofollow">https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...</a><p>I conjecture that for every j > 0 in R, a number n exists so that any two n x n matrices can be multiplied together in O(n^(2+j)) steps.<p>(Now proven for for 2+j = w = 2.3728596, or j > 0.3728596)
This readme does a really poor job of explaining what the improvement is or how they drop half the multiplications. What is the Big O run time on this? Is this shifting the known best bounds?<p>And the diagrams are chaotic and don't really explain anything about why this approach is fast or good. The result is that I'm reluctant to even click-through to the PDF.<p>If you want to improve the project credibility please consider being honest and open about what is actually going on and giving some clear explanations and illustrations, rather than things that may as well be designed to hype people too busy to tell you that you are cranks. It's hard to tell if this is incredibly groundbreaking or just but nothingburger. Sadly I almost feel like that must be an intentional decision motivated by poor merits of work and a desire to exploit AI height. The alternative - which I prefer to believe is the case - is that the author simply needs to revise and better contextualize.