科技回声

>Aggressively pruning LLMs via quantization can significantly reduce their accuracy and you might be better off using a smaller model in the first place.<p>Not sure that is correct. Quantization charts suggest its a fairly continous spectrum. i.e. an aggressive quant 13B ends up about same as a no quant 7B:<p><a href="https://www.researchgate.net/figure/Performance-degradation-of-quantized-models-Chart-available-at_fig1_377817624" rel="nofollow">https://www.researchgate.net/figure/Performance-degradation-...</a>

If you're hitting a memory wall it means you're not scaling. This stuff really doesn't apply to scaled up inference but rather local small batch execution

How to evaluate performance of LLM inference frameworks

2 条评论

How to evaluate performance of LLM inference frameworks

2 条评论