> We serve Llama on 2 80-GB A100 GPUs, as that is the minumum required to fit Llama in memory (with 16-bit precision)<p>Well there is your problem.<p>LLaMA quantized to 4 bits fits in 40GB. And it gets similar throughput split between dual consumer GPUs, which likely means much better throughput on a single 40GB A100 (or a cheaper 48GB Pro GPU)<p><a href="https://github.com/turboderp/exllama#dual-gpu-results">https://github.com/turboderp/exllama#dual-gpu-results</a><p>And this is without any consideration of batching (which I am not familiar with TBH).<p>Also, I'm not sure which model was tested, but Llama 70B chat should have better performance than the base model if the prompting syntax is right. That was only reverse engineered from the Meta demo implementation recently.<p>There are other "perks" from llama too, like manually adjusting various generation parameters, custom grammar during generation and extended context.