Sharing our work on model quantization.<p>- Blog: <a href="https://mobiusml.github.io/hqq_blog/" rel="nofollow noreferrer">https://mobiusml.github.io/hqq_blog/</a>
- Code: <a href="https://github.com/mobiusml/hqq">https://github.com/mobiusml/hqq</a>
- Models: <a href="https://huggingface.co/mobiuslabsgmbh/" rel="nofollow noreferrer">https://huggingface.co/mobiuslabsgmbh/</a><p>No data calibration needed, extremely fast , works on both language and vision models!<p>* Why does it matter?
Quantization significantly reduces GPU memory requirements but degrades the quality of the models. Having faster and more accurate quantization methods is extremely valuable for the ML community.<p>* Approach:
Sparsity-based error formulation between the original weights and their dequantized version. We used a Half-Quadratic solver to derive a closed-form solution that is 100x faster than backprop via Pytorch's Autograd.<p>* Quantization speed:
~ 1 minute for Llama2-13B
~ 4 minutes for LLama2-70B (over 50x faster than GPTQ)<p>* Findings:
- Larger models quantized to 3/2-bit outperform smaller full-precision models with similar or lower memory requirements.
- Successful 2-bit quantization requires a lower group-size (e.g., 32 or 16) and compression of both the zero-point and the scaling factor for lower memory usage.<p>While we acknowledge our view might be slightly biased, we genuinely believe that our work will significantly benefit the open-source software (OSS) machine learning community. Code and model are in Apache permissive license.