TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

QuIP#: 2-bit Quantization for LLMs

201 pointsby jasondaviesover 1 year ago

12 comments

SeanAndersonover 1 year ago
Just to make sure I&#x27;m understanding this correctly.<p>This paper signals that the authors have found a way to run Llama 2 70B, but with 1&#x2F;8th the VRAM requirements as compared to the original model, right?<p>And the output is on-par with the original along some metrics (ArcE&#x2F;PiQA), within 25% on others (Wiki&#x2F;C4), and the trajectory of their progress hints that there&#x27;s even more ground to gain in the future?
评论 #38579946 未加载
lxeover 1 year ago
Already works on oobabooga as of a few days ago: <a href="https:&#x2F;&#x2F;github.com&#x2F;oobabooga&#x2F;text-generation-webui&#x2F;issues&#x2F;4799">https:&#x2F;&#x2F;github.com&#x2F;oobabooga&#x2F;text-generation-webui&#x2F;issues&#x2F;47...</a><p>Need a few extra steps: <a href="https:&#x2F;&#x2F;github.com&#x2F;oobabooga&#x2F;text-generation-webui&#x2F;pull&#x2F;4803">https:&#x2F;&#x2F;github.com&#x2F;oobabooga&#x2F;text-generation-webui&#x2F;pull&#x2F;4803</a>
tarrudaover 1 year ago
If this quantization method works with smaller models, it would enable running up to 33B models with only 12GB VRAM.<p>Especially important for democratizing access to Mistral MoE new model.
评论 #38583624 未加载
bongwater_OSover 1 year ago
One of the best papers I&#x27;ve read in a long time. This could be huge.
karmasimidaover 1 year ago
For quantization, you should always verify directly on your own intended tasks, not trusting the quantization will preserve accuracy on a boarder spectrums of tasks, because surprises are not that infrequent.
pyinstallwoesover 1 year ago
Since a pixel can have more states than binary, could you have more space and compute by leveraging RGBA-space for data&#x2F;compute than binary?<p>Maybe a stupid question.
评论 #38580387 未加载
评论 #38585817 未加载
saberienceover 1 year ago
I’m a layperson when it comes to this topic but does this mean every value in the network is a value from 00 to 11? I.e: 00, 01, 10, and 11?<p>I struggle to understand how a network with only two bits of precision could ever generate text or numbers or anything really.<p>Is my intuition wrong here? If so, can someone give an example of what it means to quantize the network down to 2 bits only.
评论 #38583029 未加载
skaviover 1 year ago
can anyone comment on running the 2b quantized llama 70b on consumer cards like the 4090?
评论 #38578081 未加载
评论 #38577443 未加载
评论 #38582773 未加载
omneityover 1 year ago
How does this 2-bit quantization method compare to HQQ which was posted yesterday?<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38563537">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38563537</a>
DrNosferatuover 1 year ago
Does LM Studio support it?<p>By the way, what’s your favorite easy-to-use LLM front end?
评论 #38581238 未加载
skykoolerover 1 year ago
I wonder whether quantization to 1-bit would be functional?
评论 #38580044 未加载
评论 #38578187 未加载
评论 #38592670 未加载
评论 #38590439 未加载
评论 #38579194 未加载
评论 #38594160 未加载
评论 #38577892 未加载
shahbazacover 1 year ago
Can someone answer CS 101 questions about this please.<p>I know there are other methods related to matrix factorization, but I’m asking specifically about quantization.<p>Does quantization literally mean the weight matrix floats are being represented using fewer bits than the 64 bit standard?<p>Second, if fewer bits are being used, are CPUs able to do math directly on fewer bits? Aren’t CPU registers still 64 bit? Are these floats converted back to 64 bit for math, or is there some clever packing technique where a 64 bit float actually represents many numbers (sort of a hackey simd instruction)? Or do modern CPUs have the hardware to do math on fewer bits?
评论 #38582569 未加载
评论 #38582116 未加载