2 pointsby jlaneveabout 1 year ago

1 comment

jlaneveabout 1 year ago

Hey HN!<p>We've had lots of success using quantized LLMs for inference speed and cost because you can fit them on smaller GPUs (Nvidia T4, Nvidia K80, RTX 4070, etc). There's no need for everyone to quantize - we quantized Llama 3 8b Instruct to 8 bits using GPTQ and figured we'd share it with the community. Excited to see what everyone does with it!

评论 #40086599 未加载

Llama 3 8B Instruct quantized with GPTQ to fit in 10gb vRAM

1 comment

Llama 3 8B Instruct quantized with GPTQ to fit in 10gb vRAM

1 comment