TechEcho

12 comments

Just to make sure I'm understanding this correctly.This paper signals that the authors have found a way to run Llama 2 70B, but with 1/8th the VRAM requirements as compared to the original model, right?And the output is on-par with the original along some metrics (ArcE/PiQA), within 25% on others (Wiki/C4), and the trajectory of their progress hints that there's even more ground to gain in the future?

评论 #38579946 未加载

lxeover 1 year ago

Already works on oobabooga as of a few days ago: <a href="https://github.com/oobabooga/text-generation-webui/issues/4799">https://github.com/oobabooga/text-generation-webui/issues/47...</a>Need a few extra steps: <a href="https://github.com/oobabooga/text-generation-webui/pull/4803">https://github.com/oobabooga/text-generation-webui/pull/4803</a>

tarrudaover 1 year ago

If this quantization method works with smaller models, it would enable running up to 33B models with only 12GB VRAM.Especially important for democratizing access to Mistral MoE new model.

评论 #38583624 未加载

bongwater_OSover 1 year ago

One of the best papers I've read in a long time. This could be huge.

karmasimidaover 1 year ago

For quantization, you should always verify directly on your own intended tasks, not trusting the quantization will preserve accuracy on a boarder spectrums of tasks, because surprises are not that infrequent.

pyinstallwoesover 1 year ago

Since a pixel can have more states than binary, could you have more space and compute by leveraging RGBA-space for data/compute than binary?Maybe a stupid question.

评论 #38580387 未加载

评论 #38585817 未加载

saberienceover 1 year ago

I’m a layperson when it comes to this topic but does this mean every value in the network is a value from 00 to 11? I.e: 00, 01, 10, and 11?I struggle to understand how a network with only two bits of precision could ever generate text or numbers or anything really.Is my intuition wrong here? If so, can someone give an example of what it means to quantize the network down to 2 bits only.

评论 #38583029 未加载

skaviover 1 year ago

can anyone comment on running the 2b quantized llama 70b on consumer cards like the 4090?

评论 #38578081 未加载

评论 #38577443 未加载

评论 #38582773 未加载

omneityover 1 year ago

How does this 2-bit quantization method compare to HQQ which was posted yesterday?<a href="https://news.ycombinator.com/item?id=38563537">https://news.ycombinator.com/item?id=38563537</a>

DrNosferatuover 1 year ago

Does LM Studio support it?By the way, what’s your favorite easy-to-use LLM front end?

评论 #38581238 未加载

skykoolerover 1 year ago

I wonder whether quantization to 1-bit would be functional?

评论 #38580044 未加载

评论 #38578187 未加载

评论 #38592670 未加载

评论 #38590439 未加载

评论 #38579194 未加载

评论 #38594160 未加载

评论 #38577892 未加载

shahbazacover 1 year ago

Can someone answer CS 101 questions about this please.I know there are other methods related to matrix factorization, but I’m asking specifically about quantization.Does quantization literally mean the weight matrix floats are being represented using fewer bits than the 64 bit standard?Second, if fewer bits are being used, are CPUs able to do math directly on fewer bits? Aren’t CPU registers still 64 bit? Are these floats converted back to 64 bit for math, or is there some clever packing technique where a 64 bit float actually represents many numbers (sort of a hackey simd instruction)? Or do modern CPUs have the hardware to do math on fewer bits?

评论 #38582569 未加载

评论 #38582116 未加载

12 comments

SeanAndersonover 1 year ago

评论 #38579946 未加载

lxeover 1 year ago

tarrudaover 1 year ago

If this quantization method works with smaller models, it would enable running up to 33B models with only 12GB VRAM.Especially important for democratizing access to Mistral MoE new model.

评论 #38583624 未加载

bongwater_OSover 1 year ago

One of the best papers I've read in a long time. This could be huge.

karmasimidaover 1 year ago

pyinstallwoesover 1 year ago

Since a pixel can have more states than binary, could you have more space and compute by leveraging RGBA-space for data/compute than binary?Maybe a stupid question.

评论 #38580387 未加载

评论 #38585817 未加载

saberienceover 1 year ago

评论 #38583029 未加载

skaviover 1 year ago

can anyone comment on running the 2b quantized llama 70b on consumer cards like the 4090?

评论 #38578081 未加载

评论 #38577443 未加载

评论 #38582773 未加载

omneityover 1 year ago

How does this 2-bit quantization method compare to HQQ which was posted yesterday?<a href="https://news.ycombinator.com/item?id=38563537">https://news.ycombinator.com/item?id=38563537</a>

DrNosferatuover 1 year ago

Does LM Studio support it?By the way, what’s your favorite easy-to-use LLM front end?

QuIP#: 2-bit Quantization for LLMs

12 comments

QuIP#: 2-bit Quantization for LLMs

12 comments