AI engineers claim new algorithm reduces AI power consumption by 95%

370 点作者 ferriswil8 个月前

35 条评论

<a href="https://arxiv.org/abs/2410.00907" rel="nofollow">https://arxiv.org/abs/2410.00907</a>ABSTRACTLarge neural networks spend most computation on floating point tensor multiplications. In this work, we find that a floating point multiplier can be approximated by one integer adder with high precision. We propose the linear-complexity multiplication (L-Mul) algorithm that approximates floating point number multiplication with integer addition operations. The new algorithm costs significantly less computation resource than 8-bit floating point multiplication but achieves higher precision. Compared to 8-bit floating point multiplications, the proposed method achieves higher precision but consumes significantly less bit-level computation. Since multiplying floating point numbers requires substantially higher energy compared to integer addition operations, applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products. We calculated the theoretical error expectation of L-Mul, and evaluated the algorithm on a wide range of textual, visual, and symbolic tasks, including natural language understanding, structural reasoning, mathematics, and commonsense question answering. Our numerical analysis experiments agree with the theoretical error estimation, which indicates that L-Mul with 4-bit mantissa achieves comparable precision as float8 e4m3 multiplications, and L-Mul with 3-bit mantissa outperforms float8 e5m2. Evaluation results on popular benchmarks show that directly applying L-Mul to the attention mechanism is almost lossless. We further show that replacing all floating point multiplications with 3-bit mantissa L-Mul in a transformer model achieves equivalent precision as using float8 e4m3 as accumulation precision in both fine-tuning and inference.

评论 #41890324 未加载

评论 #41892025 未加载

评论 #41921796 未加载

评论 #41901112 未加载

jart8 个月前

It's a very crude approximation, e.g. 1.75 * 2.5 == 3 (although it seems better as the numbers get closer to 0).I tried implementing this for AVX512 with tinyBLAS in llamafile.<pre><code> inline __m512 lmul512(__m512 x, __m512 y) { __m512i sign_mask = _mm512_set1_epi32(0x80000000); __m512i exp_mask = _mm512_set1_epi32(0x7F800000); __m512i mant_mask = _mm512_set1_epi32(0x007FFFFF); __m512i exp_bias = _mm512_set1_epi32(127); __m512i x_bits = _mm512_castps_si512(x); __m512i y_bits = _mm512_castps_si512(y); __m512i sign_x = _mm512_and_si512(x_bits, sign_mask); __m512i sign_y = _mm512_and_si512(y_bits, sign_mask); __m512i exp_x = _mm512_srli_epi32(_mm512_and_si512(x_bits, exp_mask), 23); __m512i exp_y = _mm512_srli_epi32(_mm512_and_si512(y_bits, exp_mask), 23); __m512i mant_x = _mm512_and_si512(x_bits, mant_mask); __m512i mant_y = _mm512_and_si512(y_bits, mant_mask); __m512i sign_result = _mm512_xor_si512(sign_x, sign_y); __m512i exp_result = _mm512_sub_epi32(_mm512_add_epi32(exp_x, exp_y), exp_bias); __m512i mant_result = _mm512_srli_epi32(_mm512_add_epi32(mant_x, mant_y), 1); __m512i result_bits = _mm512_or_si512( _mm512_or_si512(sign_result, _mm512_slli_epi32(exp_result, 23)), mant_result); return _mm512_castsi512_ps(result_bits); } </code></pre> Then I used it for Llama-3.2-3B-Instruct.F16.gguf and it outputted jibberish. So you would probably have to train and design your model specifically to use this multiplication approximation in order for it to work. Or maybe I'd have to tune the model so that only certain layers and/or operations use the approximation. However the speed was decent. Prefill only dropped from 850 tokens per second to 200 tok/sec on my threadripper. Prediction speed was totally unaffected, staying at 34 tok/sec. I like how the code above generates vpternlog ops. So if anyone ever designs an LLM architecture and releases weights on Hugging Face that use this algorithm, we'll be able to run them reasonably fast without special hardware.

评论 #41893810 未加载

kayo_202110308 个月前

Extraordinary claims require extraordinary evidence. Maybe it's possible, but consider that some really smart people, in many different groups, have been working diligently in this space for quite a while; so claims of 95% savings on energy costs _with equivalent performance_ is in the extraordinary category. Of course, we'll see when the tide goes out.

评论 #41890379 未加载

评论 #41890322 未加载

评论 #41890352 未加载

评论 #41890280 未加载

评论 #41890428 未加载

评论 #41890702 未加载

jhj8 个月前

As someone who has worked in this space (approximate compute) on both GPUs and in silicon in my research, the power consumption claims are completely bogus, as are the accuracy claims:> In this section, we show that L-Mul is more precise than fp8 e4m3 multiplications> To be concise, we do not consider the rounding to nearest even mode in both error analysis and complexity estimation for both Mul and L-MulThese two statements together are non-sensical. Sure, if you analyze accuracy while ignoring the part of the algorithm that gives you accuracy in the baseline you can derive whatever cherry-picked result you want.The multiplication of two floating point values if you round to nearest even will be the correctly rounded result of multiplying the original values at infinite precision, this is how floating point rounding usually works and what IEEE 754 mandates for fundamental operations if you choose to follow those guidelines (e.g., multiplication here). But not rounding to nearest even will result in a lot more quantization noise, and biased noise at that too.> applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot productsA good chunk of the energy cost is simply moving data between memories (especially external DRAM/HBM/whatever) and along wires, buffering values in SRAMs and flip-flops and the like. Combinational logic cost is usually not a big deal. While having a ton of fixed-function matrix multipliers does raise the cost of combinational logic quite a bit, at most what they have will probably cut the power of an overall accelerator by 10-20% or so.> In this section, we demonstrate that L-Mul can replace tensor multiplications in the attention mechanism without any loss of performance, whereas using fp8 multiplications for the same purpose degrades inference accuracyI may have missed it in the paper, but they have provided no details on (re)scaling and/or using higher precision accumulation for intermediate results as one would experience on an H100 for instance. Without this information, I don't trust these evaluation results either.

_aavaa_8 个月前

Original discussion of the preprint: <a href="https://news.ycombinator.com/item?id=41784591">https://news.ycombinator.com/item?id=41784591</a>

评论 #41891595 未加载

remexre8 个月前

Isn't this just taking advantage of "log(x) + log(y) = log(xy)"? The IEEE754 floating-point representation stores floats as sign, mantissa, and exponent -- ignore the first two (you quantitized anyway, right?), and the exponent is just an integer storing log() of the float.

评论 #41890236 未加载

评论 #41889800 未加载

robomartin8 个月前

I posted this about a week ago:<a href="https://news.ycombinator.com/item?id=41816598">https://news.ycombinator.com/item?id=41816598</a>This has been done for decades in digital circuits, FPGA’s, Digital Signal Processing, etc. Floating point is both resource and power intensive and using FP without the use of dedicated FP processing hardware is something that has been avoided and done without for decades unless absolutely necessary.

评论 #41890812 未加载

评论 #41890498 未加载

评论 #41890331 未加载

didgetmaster8 个月前

Maybe I am just a natural skeptic, but whenever I see a headline that says 'method x reduces y by z%'; but when you read the text it instead says that optimizing some step 'could potentially reduce y by up to z%'; I am suspicious.Why not publish some actual benchmarks that prove your claim in even a few special cases?

评论 #41891234 未加载

评论 #41891162 未加载

评论 #41891545 未加载

评论 #41891148 未加载

GistNoesis8 个月前

Does <a href="https://en.wikipedia.org/wiki/Jevons_paradox" rel="nofollow">https://en.wikipedia.org/wiki/Jevons_paradox</a> apply in this case ?

评论 #41891906 未加载

评论 #41891105 未加载

评论 #41890949 未加载

holoduke8 个月前

I don't think algorithms will change energy consumption. There is always max capacity needed in terms of computing. If tomorrow a new algorithm increases the performance 4 times, we will just have 4 times more computing.

Art96818 个月前

In the end the power consumption means the current models that are "good enough" will fit a much smaller compute budget such as edge devices. However, enthusiasts are still going to want the best hardware they can afford because inevitably, everyone will want to maximize the size and intelligence of a model they can run. So we're just going to scale. This might bring a GPT-4 level to edge devices, but we are still going to want to run what might resemble a GPT-5/6 model on the best hardware possible at the time. So don't throw away your GPU's yet. This will bring capabilities to mass market, but your high end GPU will still scale the solution n-fold and youll be able to run models with disregard to the energy savings promoted in the headline.In other sensationalized words: "AI engineers can claim new algorithm allows them to fit GPT-5 in an RTX5090 running at 600 watts."

gcanyon8 个月前

This isn't really the optimization I'm think about, but: given the weird and abstract nature of the functioning of ML in general and LLMs in particular, it seems reasonable to think that there might be algorithms that achieve the same, or a similar, result in an orders-of-magnitude more efficient way.

greenthrow8 个月前

The trend of hyping up papers too early on is eroding people's faith in science due to poor journalism failing to explain that this is theoretical. The outlets that do this should pay the price but they don't, because almost every outlet does it.

panosv8 个月前

Lemurian Labs looks like it's doing something similar: <a href="https://www.lemurianlabs.com/technology" rel="nofollow">https://www.lemurianlabs.com/technology</a> They use the Logarithmic Number System (LNS)

ein0p8 个月前

As a rule, compute only takes less than 10% of all energy. 90% is data movement.

idiliv8 个月前

Duplicate, posted on October 9: <a href="https://news.ycombinator.com/item?id=41784591">https://news.ycombinator.com/item?id=41784591</a>

hello_computer8 个月前

How does this differ from Cussen & Ullman?<a href="https://arxiv.org/abs/2307.01415" rel="nofollow">https://arxiv.org/abs/2307.01415</a>

评论 #41892483 未加载

littlestymaar8 个月前

Related: <a href="https://news.ycombinator.com/item?id=41784591">https://news.ycombinator.com/item?id=41784591</a> 10 days ago

andrewstuart8 个月前

Here is the Microsoft implementation:<a href="https://github.com/microsoft/BitNet">https://github.com/microsoft/BitNet</a>

syntaxing8 个月前

I’m looking forward to Bitnet adaptation. MS just released a tool for it similar to llamacpp. Really hoping major models get retrained for it.

creativenolo8 个月前

Simple question: if true, would power consumption stay at 100% because we’d work the algorithm harder?I had assumed the latency etc were based on what was desirable for the use case and hardware, rather than power consumption.

asicsarecool8 个月前

Don't assume this isn't already in place at the main AI companies

svilen_dobrev8 个月前

i am not well versed in the math involved, but IMO if the outcome depends mostly on the differences between them numbers, as smaller-or-bigger distinction as well as their magnitudes, then exactness might not be needed. i mean, as long as the approximate "function" looks similar to the exact one, that might be good enough.Maybe even generate a table of the approximate results and use that, in various stages? Like the way sin/cos was done 30y ago before FP coprocessors arrived

m4638 个月前

So couldn't you design a GPU that uses or supports this algorithm to use the same power, but use bigger models, better models, or do more work?

DennisL1238 个月前

This is a result on 8 bit numbers, right? Why not precompute all 64k possible combinations and look up the results from the table?

andrewstuart8 个月前

The ultimate “you’re doing it wrong”.For he sake of the climate and environment it would be nice to be true.Bad news for Nvidia. “Sell your stock” bad.Does it come with a demonstration?

评论 #41890507 未加载

评论 #41890991 未加载

评论 #41890492 未加载

评论 #41892401 未加载

faragon8 个月前

Before reading the article I was expecting using 1-bit instead of bfloats, and using logical operators instead of arithmetic.

DrNosferatu8 个月前

Why they don’t implement the algorithm in a FPGA to compare with a classical baseline?

Wheatman8 个月前

Isnt 90% of the enrgy spent moving bytes around? Why would this have such a great affect?

m3kw98 个月前

This sounds similar to someone saying room temp super conductor was discovered

tartakovsky8 个月前

original paper: <a href="https://news.ycombinator.com/item?id=41784591">https://news.ycombinator.com/item?id=41784591</a>

DesiLurker8 个月前

validity of the claim aside, why dont they say reduces by 20 times instead of 95%. its much better perspective of a fraction when fraction is tiny.

nprateem8 个月前

Is it the one where you delete 95% of user accounts?

neuroelectron8 个月前

Nobody is interested in this because nobody wants less capex.

quantadev8 个月前

I wonder if someone has feed this entire "problem" into the latest Chat GPT-01 (the new model with reasoning capability), and just fed it in all the code for a Multilayer Perceptron and then given it the task/prompt of finding ways to implement the same network using only integer operations.Surely even the OpenAI devs must have done this like the minute they got done training that model, right? I wonder if they'd even admit it was an AI that came up with the solution rather than just publishing it, and taking credit. haha.

评论 #41891120 未加载

35 条评论

djoldman8 个月前

评论 #41890324 未加载

评论 #41892025 未加载

评论 #41921796 未加载

评论 #41901112 未加载

jart8 个月前

评论 #41893810 未加载

kayo_202110308 个月前

评论 #41890379 未加载

评论 #41890322 未加载

评论 #41890352 未加载

评论 #41890280 未加载

评论 #41890428 未加载

评论 #41890702 未加载

jhj8 个月前

_aavaa_8 个月前

Original discussion of the preprint: <a href="https://news.ycombinator.com/item?id=41784591">https://news.ycombinator.com/item?id=41784591</a>

评论 #41891595 未加载

remexre8 个月前

评论 #41890236 未加载

评论 #41889800 未加载

robomartin8 个月前

评论 #41890812 未加载

评论 #41890498 未加载

评论 #41890331 未加载

didgetmaster8 个月前

评论 #41891234 未加载

评论 #41891162 未加载

评论 #41891545 未加载

评论 #41891148 未加载

GistNoesis8 个月前

Does <a href="https://en.wikipedia.org/wiki/Jevons_paradox" rel="nofollow">https://en.wikipedia.org/wiki/Jevons_paradox</a> apply in this case ?

评论 #41891906 未加载

评论 #41891105 未加载

评论 #41890949 未加载

holoduke8 个月前

Art96818 个月前

gcanyon8 个月前

greenthrow8 个月前

panosv8 个月前

ein0p8 个月前

As a rule, compute only takes less than 10% of all energy. 90% is data movement.

idiliv8 个月前

Duplicate, posted on October 9: <a href="https://news.ycombinator.com/item?id=41784591">https://news.ycombinator.com/item?id=41784591</a>

hello_computer8 个月前

How does this differ from Cussen & Ullman?<a href="https://arxiv.org/abs/2307.01415" rel="nofollow">https://arxiv.org/abs/2307.01415</a>

评论 #41892483 未加载

littlestymaar8 个月前

Related: <a href="https://news.ycombinator.com/item?id=41784591">https://news.ycombinator.com/item?id=41784591</a> 10 days ago

andrewstuart8 个月前

Here is the Microsoft implementation:<a href="https://github.com/microsoft/BitNet">https://github.com/microsoft/BitNet</a>

syntaxing8 个月前

I’m looking forward to Bitnet adaptation. MS just released a tool for it similar to llamacpp. Really hoping major models get retrained for it.

creativenolo8 个月前

asicsarecool8 个月前

Don't assume this isn't already in place at the main AI companies

svilen_dobrev8 个月前

m4638 个月前

So couldn't you design a GPU that uses or supports this algorithm to use the same power, but use bigger models, better models, or do more work?

DennisL1238 个月前

This is a result on 8 bit numbers, right? Why not precompute all 64k possible combinations and look up the results from the table?

andrewstuart8 个月前

评论 #41890507 未加载

评论 #41890991 未加载

评论 #41890492 未加载

评论 #41892401 未加载

faragon8 个月前

Before reading the article I was expecting using 1-bit instead of bfloats, and using logical operators instead of arithmetic.

DrNosferatu8 个月前

Why they don’t implement the algorithm in a FPGA to compare with a classical baseline?

Wheatman8 个月前

Isnt 90% of the enrgy spent moving bytes around? Why would this have such a great affect?

m3kw98 个月前

This sounds similar to someone saying room temp super conductor was discovered

tartakovsky8 个月前

original paper: <a href="https://news.ycombinator.com/item?id=41784591">https://news.ycombinator.com/item?id=41784591</a>

DesiLurker8 个月前

validity of the claim aside, why dont they say reduces by 20 times instead of 95%. its much better perspective of a fraction when fraction is tiny.

nprateem8 个月前

Is it the one where you delete 95% of user accounts?

neuroelectron8 个月前

Nobody is interested in this because nobody wants less capex.

quantadev8 个月前

评论 #41891120 未加载