Just to make sure I'm understanding this correctly.<p>This paper signals that the authors have found a way to run Llama 2 70B, but with 1/8th the VRAM requirements as compared to the original model, right?<p>And the output is on-par with the original along some metrics (ArcE/PiQA), within 25% on others (Wiki/C4), and the trajectory of their progress hints that there's even more ground to gain in the future?
Already works on oobabooga as of a few days ago: <a href="https://github.com/oobabooga/text-generation-webui/issues/4799">https://github.com/oobabooga/text-generation-webui/issues/47...</a><p>Need a few extra steps: <a href="https://github.com/oobabooga/text-generation-webui/pull/4803">https://github.com/oobabooga/text-generation-webui/pull/4803</a>
If this quantization method works with smaller models, it would enable running up to 33B models with only 12GB VRAM.<p>Especially important for democratizing access to Mistral MoE new model.
For quantization, you should always verify directly on your own intended tasks, not trusting the quantization will preserve accuracy on a boarder spectrums of tasks, because surprises are not that infrequent.
Since a pixel can have more states than binary, could you have more space and compute by leveraging RGBA-space for data/compute than binary?<p>Maybe a stupid question.
I’m a layperson when it comes to this topic but does this mean every value in the network is a value from 00 to 11? I.e: 00, 01, 10, and 11?<p>I struggle to understand how a network with only two bits of precision could ever generate text or numbers or anything really.<p>Is my intuition wrong here? If so, can someone give an example of what it means to quantize the network down to 2 bits only.
How does this 2-bit quantization method compare to HQQ which was posted yesterday?<p><a href="https://news.ycombinator.com/item?id=38563537">https://news.ycombinator.com/item?id=38563537</a>
Can someone answer CS 101 questions about this please.<p>I know there are other methods related to matrix factorization, but I’m asking specifically about quantization.<p>Does quantization literally mean the weight matrix floats are being represented using fewer bits than the 64 bit standard?<p>Second, if fewer bits are being used, are CPUs able to do math directly on fewer bits? Aren’t CPU registers still 64 bit? Are these floats converted back to 64 bit for math, or is there some clever packing technique where a 64 bit float actually represents many numbers (sort of a hackey simd instruction)? Or do modern CPUs have the hardware to do math on fewer bits?