TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

QuIP#: 2-bit Quantization for LLMs

201 点作者 jasondavies超过 1 年前

12 条评论

SeanAnderson超过 1 年前
Just to make sure I&#x27;m understanding this correctly.<p>This paper signals that the authors have found a way to run Llama 2 70B, but with 1&#x2F;8th the VRAM requirements as compared to the original model, right?<p>And the output is on-par with the original along some metrics (ArcE&#x2F;PiQA), within 25% on others (Wiki&#x2F;C4), and the trajectory of their progress hints that there&#x27;s even more ground to gain in the future?
评论 #38579946 未加载
lxe超过 1 年前
Already works on oobabooga as of a few days ago: <a href="https:&#x2F;&#x2F;github.com&#x2F;oobabooga&#x2F;text-generation-webui&#x2F;issues&#x2F;4799">https:&#x2F;&#x2F;github.com&#x2F;oobabooga&#x2F;text-generation-webui&#x2F;issues&#x2F;47...</a><p>Need a few extra steps: <a href="https:&#x2F;&#x2F;github.com&#x2F;oobabooga&#x2F;text-generation-webui&#x2F;pull&#x2F;4803">https:&#x2F;&#x2F;github.com&#x2F;oobabooga&#x2F;text-generation-webui&#x2F;pull&#x2F;4803</a>
tarruda超过 1 年前
If this quantization method works with smaller models, it would enable running up to 33B models with only 12GB VRAM.<p>Especially important for democratizing access to Mistral MoE new model.
评论 #38583624 未加载
bongwater_OS超过 1 年前
One of the best papers I&#x27;ve read in a long time. This could be huge.
karmasimida超过 1 年前
For quantization, you should always verify directly on your own intended tasks, not trusting the quantization will preserve accuracy on a boarder spectrums of tasks, because surprises are not that infrequent.
pyinstallwoes超过 1 年前
Since a pixel can have more states than binary, could you have more space and compute by leveraging RGBA-space for data&#x2F;compute than binary?<p>Maybe a stupid question.
评论 #38580387 未加载
评论 #38585817 未加载
saberience超过 1 年前
I’m a layperson when it comes to this topic but does this mean every value in the network is a value from 00 to 11? I.e: 00, 01, 10, and 11?<p>I struggle to understand how a network with only two bits of precision could ever generate text or numbers or anything really.<p>Is my intuition wrong here? If so, can someone give an example of what it means to quantize the network down to 2 bits only.
评论 #38583029 未加载
skavi超过 1 年前
can anyone comment on running the 2b quantized llama 70b on consumer cards like the 4090?
评论 #38578081 未加载
评论 #38577443 未加载
评论 #38582773 未加载
omneity超过 1 年前
How does this 2-bit quantization method compare to HQQ which was posted yesterday?<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38563537">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38563537</a>
DrNosferatu超过 1 年前
Does LM Studio support it?<p>By the way, what’s your favorite easy-to-use LLM front end?
评论 #38581238 未加载
skykooler超过 1 年前
I wonder whether quantization to 1-bit would be functional?
评论 #38580044 未加载
评论 #38578187 未加载
评论 #38592670 未加载
评论 #38590439 未加载
评论 #38579194 未加载
评论 #38594160 未加载
评论 #38577892 未加载
shahbazac超过 1 年前
Can someone answer CS 101 questions about this please.<p>I know there are other methods related to matrix factorization, but I’m asking specifically about quantization.<p>Does quantization literally mean the weight matrix floats are being represented using fewer bits than the 64 bit standard?<p>Second, if fewer bits are being used, are CPUs able to do math directly on fewer bits? Aren’t CPU registers still 64 bit? Are these floats converted back to 64 bit for math, or is there some clever packing technique where a 64 bit float actually represents many numbers (sort of a hackey simd instruction)? Or do modern CPUs have the hardware to do math on fewer bits?
评论 #38582569 未加载
评论 #38582116 未加载