Addition is all you need for energy-efficient language models

334 点作者 InvisibleUp8 个月前

19 条评论

I remember that many years ago, when floating point computation was expensive for Intel CPUs to do, there were multiple ways that programmers used integer trickery to work around this.Chuck Moore of Forth fame demonstrated taking the value, say 1.6 multiplied by 4.1 and doing all the intermediate calculations via integers (16 * 41) and then formatting the output by putting the decimal point back in the "right place"; this worked as long as the range of floating point values was within a range that multiplying by 10 didn't exceed 65536 (16 bit integers), for instance. For embedded chips where for instance, you have an analog reading with 10 bits precision to quickly compute multiple times per second, this worked well.I also recall talking many years ago with a Microsoft engineer who had worked with the Microsoft Streets and Trips program (<a href="https://archive.org/details/3135521376_qq_CD1" rel="nofollow">https://archive.org/details/3135521376_qq_CD1</a> for a screenshot) and that they too had managed to fit what would normally be floating point numbers and the needed calculations into some kind of packed integer format with only the precision that was actually needed, that was faster on the CPUs of the day as well as more easily compressed to fit on the CDROM.

评论 #41788910 未加载

评论 #41790066 未加载

评论 #41793330 未加载

评论 #41788525 未加载

评论 #41787964 未加载

评论 #41788842 未加载

评论 #41791409 未加载

visarga8 个月前

> can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot productsIt this were about convolutional nets then optimizing compute would be a much bigger deal. Transformers are lightweight on compute and heavy on memory. The weakest link in the chain is fetching the model weights into the cores. The 95% and 80% energy reductions cited are for the multiplication operations in isolation, not for the entire inference process.

评论 #41786798 未加载

评论 #41785373 未加载

评论 #41785043 未加载

评论 #41787247 未加载

评论 #41785383 未加载

评论 #41785032 未加载

评论 #41787099 未加载

评论 #41824505 未加载

tantalor8 个月前

[2023] GradIEEEnt half decent: The hidden power of imprecise lines<a href="http://tom7.org/grad/murphy2023grad.pdf" rel="nofollow">http://tom7.org/grad/murphy2023grad.pdf</a>Also in video form: <a href="https://www.youtube.com/watch?v=Ae9EKCyI1xU" rel="nofollow">https://www.youtube.com/watch?v=Ae9EKCyI1xU</a>

评论 #41833120 未加载

评论 #41791963 未加载

js88 个月前

Haven't read it, but isn't this just logarithmic tables in some form?I am asking not to dismiss it, I genuinely feel I don't understand logarithms on a fundamental level (of logic gates etc.). If multiplication can be replaced with table lookup and addition, then there has to be a circuit that gives you difficult addition and easy multiplication, or any combination of those tradeoffs.

评论 #41800512 未加载

评论 #41787705 未加载

jenda238 个月前

Highly recommended!! Success achieved! Previously I had worked with another well regarded company to attempt recovering an Ethereum presale wallet passphrase that I had forgotten. After 14 months of trying there was no success, so then I looked into ReWallet. They were able to find the password solution in 6 weeks! Since I only remembered a few portions or clues, it seemed like a nearly impossible task. They worked diligently and very professionally. I fully recommend and trust these guys, the result speaks for itself. Contact email, ‎rewalletshieldcoinrecovery@ aol.com or WhatsApp::+1 (757) 332-1885

cpldcpu8 个月前

It puzzles me that there does not seem to be a proper derivation and discussion of the error term in the paper. It's all treated indirectly way inference results.

评论 #41789185 未加载

pjc508 个月前

"We recommend training and hosting L-Mul-based models on devices integrated with specialized architectural designs. Patent pending"(from footnote in method section)

CGamesPlay8 个月前

I believe this reduces the compute required, but still uses 8 bits per value, so it does not reduce the memory requirements required to run inference, so it doesn’t particularly make the models more accessible for inference. Is this storage method suitable for training? That could potentially be an interesting application.

评论 #41802715 未加载

ein0p8 个月前

More than 10x the amount of energy is spent moving bytes around. Compute efficiency is not as big of an issue as people think. It’s just that the compute is in the wrong place now - it needs to be right next to memory cells, bypassing the memory bus, at least in the initial aggregations that go into dot products.

评论 #41791828 未加载

presspot8 个月前

From my experience, the absolute magicians in fixed point math were the 8-bit and 16-bit video game designers. I was in awe of the optimizations they did. They made it possible to calculate 3D matrix maths in real time, for example, in order to make the first flight simulators and first person shooter games.

评论 #41792759 未加载

Buttons8408 个月前

Would using this neural network based on integer addition be faster? The paper does not claim it would be faster, so I'm assuming not?What about over time? If this L-Mul (the matrix operation based on integer addition) operation proved to be much more energy efficient and became popular, would new hardware be created that was faster?

cpldcpu8 个月前

Bill Dally from nvidia introduced a log representation that basically allows to replace a multiplication with an add, without loss of accuracy (in contract to proposal above)<a href="https://youtu.be/gofI47kfD28?t=2248" rel="nofollow">https://youtu.be/gofI47kfD28?t=2248</a>

评论 #41890301 未加载

scotty798 个月前

All You Need is Considered Harmful.

评论 #41788408 未加载

concrete_head8 个月前

Just too add an alternative addition based architecture into the mix.<a href="https://www.youtube.com/watch?v=VqXwmVpCyL0" rel="nofollow">https://www.youtube.com/watch?v=VqXwmVpCyL0</a>

dwrodri8 个月前

7 years of the same title format is all you need.

md_rumpf8 个月前

The return of the CPU?!

评论 #41786191 未加载

A4ET8a8uTh08 个月前

Uhh.. I hate to be the one to ask this question, but shouldn't we be focused on making LLMs work well first and then focused on desired optimizations? Using everyone's car analogy, it is like making sure early cars are using lower amount of coal. It is a fool's errand.

评论 #41788900 未加载

评论 #41788718 未加载

评论 #41790885 未加载

评论 #41788185 未加载

评论 #41797108 未加载

评论 #41797254 未加载

m3kw98 个月前

So instead of say 2x3 you go 2+2+2?

ranguna8 个月前

I've seen this claim a few time across the last couple years and I have a pet theory why this isn't explored a lot:Nvidia funds most research around LLMs, and they also fund other companies that fund other research. If transformers were to use addition and remova all usage of floating point multiplication, there's a good chance the gpu would no longer be needed, or in the least, cheaper ones would be good enough. If that were to happen, no one would need nvidia anymore and their trillion dollar empire would start to crumble.University labs get free gpus from nvidia -> University labs don't want to do research that would make said gpus obsolete because nvidia won't like that.If this were to be true, it would mean that we are stuck on an inificient research path due to corporate greed. Imagine if this really was the next best thing, and we just don't explore it more because the ruling corporation doesn't want to lose their market cap.Hopefully I'm wrong.

评论 #41786214 未加载

评论 #41786385 未加载

评论 #41786075 未加载

评论 #41786127 未加载

评论 #41787014 未加载

评论 #41787246 未加载

评论 #41786076 未加载

评论 #41793660 未加载

评论 #41787279 未加载

评论 #41786070 未加载

评论 #41786791 未加载

19 条评论

shrubble8 个月前

评论 #41788910 未加载

评论 #41790066 未加载

评论 #41793330 未加载

评论 #41788525 未加载

评论 #41787964 未加载

评论 #41788842 未加载

评论 #41791409 未加载

visarga8 个月前

评论 #41786798 未加载

评论 #41785373 未加载

评论 #41785043 未加载

评论 #41787247 未加载

评论 #41785383 未加载

评论 #41785032 未加载

评论 #41787099 未加载

评论 #41824505 未加载

tantalor8 个月前

评论 #41833120 未加载

评论 #41791963 未加载

js88 个月前

评论 #41800512 未加载

评论 #41787705 未加载

jenda238 个月前

cpldcpu8 个月前

It puzzles me that there does not seem to be a proper derivation and discussion of the error term in the paper. It's all treated indirectly way inference results.

评论 #41789185 未加载

pjc508 个月前

"We recommend training and hosting L-Mul-based models on devices integrated with specialized architectural designs. Patent pending"(from footnote in method section)

CGamesPlay8 个月前

评论 #41802715 未加载

ein0p8 个月前

评论 #41791828 未加载

presspot8 个月前

评论 #41792759 未加载

Buttons8408 个月前

cpldcpu8 个月前

评论 #41890301 未加载

scotty798 个月前

All You Need is Considered Harmful.

评论 #41788408 未加载

concrete_head8 个月前

Just too add an alternative addition based architecture into the mix.<a href="https://www.youtube.com/watch?v=VqXwmVpCyL0" rel="nofollow">https://www.youtube.com/watch?v=VqXwmVpCyL0</a>