Also see<p># Smarter Local LLMs, Lower VRAM Costs – All Without Sacrificing Quality, Thanks to Google’s New [Quantization-Aware Training] "QAT" Optimization<p><a href="https://www.hardware-corner.net/smarter-local-llm-lower-vram-20250419/" rel="nofollow">https://www.hardware-corner.net/smarter-local-llm-lower-vram...</a><p>> <i>According to Google, they’ve «reduced the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0.»</i>
Are there comparisons between int4 QAT versions of these models and the more common GGUF Q4_K_M quantizations generated post-training? The QAT models appear to be slightly larger:<p><a href="https://ollama.com/library/gemma3/tags">https://ollama.com/library/gemma3/tags</a><p>I presume QAT are better but I don't see how much better.