Gemma 3 QAT Models: Bringing AI to Consumer GPUs

602 点作者 emrah24 天前

39 条评论

simonw24 天前

I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.Some notes here: <a href="https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/" rel="nofollow">https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/</a>Last night I had it write me a complete plugin for my LLM tool like this:<pre><code> llm install llm-mlx llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit llm -m mlx-community/gemma-3-27b-it-qat-4bit \ -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \ -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \ -s 'Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment' </code></pre> It gave a solid response! <a href="https://gist.github.com/simonw/feccff6ce3254556b848c27333f52543#response" rel="nofollow">https://gist.github.com/simonw/feccff6ce3254556b848c27333f52...</a> - more notes here: <a href="https://simonwillison.net/2025/Apr/20/llm-fragments-github/" rel="nofollow">https://simonwillison.net/2025/Apr/20/llm-fragments-github/</a>

评论 #43743949 未加载

评论 #43747968 未加载

评论 #43745751 未加载

评论 #43746252 未加载

评论 #43744215 未加载

评论 #43744205 未加载

评论 #43747326 未加载

评论 #43745256 未加载

评论 #43746789 未加载

评论 #43752951 未加载

评论 #43752580 未加载

Samin10024 天前

I have a few private “vibe check” questions and the 4 bit QAT 27B model got them all correctly. I’m kind of shocked at the information density locked in just 13 GB of weights. If anyone at Deepmind is reading this — Gemma 3 27B is the single most impressive open source model I have ever used. Well done!

评论 #43748557 未加载

diggan24 天前

First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn't the obvious graph of comparing the quality between BF16 and QAT missing? The text doesn't seem to talk about it either, yet it's basically the topic of the blog post.

评论 #43743893 未加载

评论 #43743928 未加载

评论 #43745363 未加载

mark_l_watson24 天前

Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using Ollama for routine work on my 32G memory Mac.gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.I still like Gemini 2.5 Pro and o3 for brainstorming or working on difficult problems, but for routine work it (simply) makes me feel good to have everything open source/weights running on my own system.Wen I bought my 32G Mac a year ago, I didn't expect to be so happy as running gemma3:27b-it-qat with open-codex locally.

评论 #43750021 未加载

评论 #43750815 未加载

评论 #43750006 未加载

mekpro24 天前

Gemma 3 is way way better than Llama 4. I think Meta will start to lose its position in LLM mindshare. Another weakness of Llama 4 is its model size that is too large (even though it can run fast with MoE), which greatly limits the applicable users to a small percentage of enthusiasts who have enough GPU VRAM. Meanwhile, Gemma 3 is widely usable across all hardware sizes.

trebligdivad24 天前

It seems pretty impressive - I'm running it on my CPU (16 core AMD 3950x) and it's very very impressive at translation, and the image description is very impressive as well. I'm getting about 2.3token/s on it (compared to under 1/s on the Calme-3.2 I was previously using). It does tend to be a bit chatty unless you tell it not to be; pretty much everything it'll give you a 'breakdown' unless you tell it not to - so for traslation my prompt is 'Translate the input to English, only output the translation' to stop it giving a breakdown of the input language.

评论 #43744070 未加载

评论 #43747653 未加载

manjunaths24 天前

I am running this on 16 GB AMD Radeon 7900 GRE with 64 GB machine with ROCm and llama.cpp on Windows 11. I can use Open-webui or the native gui for the interface. It is made available via an internal IP to all members of my home.It runs at around 26 tokens/sec and FP16, FP8 is not supported by the Radeon 7900 GRE.I just love it.For coding QwQ 32b is still king. But with a 16GB VRAM card it gives me ~3 tokens/sec, which is unusable.I tried to make Gemma 3 write a powershell script with Terminal gui interface and it ran into dead-ends and finally gave up. QwQ 32B performed a lot better.But for most general purposes it is great. My kid's been using it to feed his school textbooks and ask it questions. It is better than anything else currently.Somehow it is more "uptight" than llama or the chinese models like Qwen. Can't put my finger on it, the Chinese models seem nicer and more talkative.

评论 #43756231 未加载

behnamoh24 天前

This is what local LLMs need—being treated like first-class citizens by the companies that make them.That said, the first graph is misleading about the number of H100s required to run DeepSeek r1 at FP16. The model is FP8.

评论 #43745837 未加载

评论 #43744437 未加载

mythz24 天前

The speed gains are real, after downloading latest QAT gemma3:27b eval perf is now 1.47x faster on ollama, up from 13.72 to 20.11 tok/s (on A4000's).

porphyra24 天前

It is funny that Microsoft had been peddling "AI PCs" and Apple had been peddling "made for Apple Intelligence" for a while now, when in fact usable models for consumer GPUs are only barely starting to be a thing on extremely high end GPUs like the 3090.

评论 #43745683 未加载

评论 #43746489 未加载

评论 #43746780 未加载

评论 #43746510 未加载

emrah24 天前

Available on ollama: <a href="https://ollama.com/library/gemma3">https://ollama.com/library/gemma3</a>

评论 #43743657 未加载

评论 #43743658 未加载

technologesus24 天前

Just for fun I created a new personal benchmark for vision-enabled LLMs: playing minecraft. I used JSON structured output in LM Studio to create basic controls for the game. Unfortunately no matter how hard I proompted, gemma-3-27b QAT is not really able to understand simple minecraft scenarios. It would say things like "I'm now looking at a stone block. I need to break it" when it is looking out at the horizon in the desert.Here is the JSON schema: <a href="https://pastebin.com/SiEJ6LEz" rel="nofollow">https://pastebin.com/SiEJ6LEz</a> System prompt: <a href="https://pastebin.com/R68QkfQu" rel="nofollow">https://pastebin.com/R68QkfQu</a>

评论 #43750759 未加载

miki12321124 天前

What would be the best way to deploy this if you're maximizing for GPU utilization in a multi-user (API) scenario? Structured output support would be a big plus.We're working with a GPU-poor organization with very strict data residency requirements, and these models might be exactly what we need.I would normally say VLLM, but the blog post notably does not mention VLLM support.

评论 #43747210 未加载

wtcactus24 天前

They keep mentioning the RTX 3090 (with 24 GB VRAM), but the model is only 14.1 GB.Shouldn’t it fit a 5060 Ti 16GB, for instance?

评论 #43743768 未加载

评论 #43743691 未加载

评论 #43747505 未加载

casey223 天前

I don't get the appeal. For LLMs to be useful at all you at least need to bin the the dozen exabit range per token, zettabit/s if you want something usable.There is really no technological path towards supercomputers that fast in a human timescale and in 100 years.The thing that makes LLMs usefull is their ability to translate concepts from one domain to the other. Overfitting on choice benchmarks, even a spread, will lower their usefullness in every general task by destorying infomation that is encoded in the weights.Ask gemma to write a 5 paragraph essay on any niche topic and you will get plenty of statements that have an extremely small likely of existing in relation to the topic, but have a high likely of existing in related more popular topics. ChatGPT less so, but still at least one a paragraph. I'm not talking about factual errors or common oversimplifications. I'm talking about completely unrelated statements. What your asking about is largely outside it's training data of which a 27GB models gives you what? a few hundred Gigs? Seems like alot, but you have to remember that there is a lot of stuff that you probably don't care about that many people do. Stainless steel and Kubernetes are going to be well represented, your favorite media? probably not, relatively current? definitely not. Which sounds fine, until you realize that people who care about Stainless steel and Kubernetes, likely care about some much more specific aspect which isn't going to be represented and you are back to the same problem of low usability.This is why I believe that scale is king and that both data and compute are the big walls. Google has Youtube data but they are only using it in Gemini.

umajho24 天前

I am currently using the Q4_K_M quantized version of gemma-3-27b-it locally. I previously assumed that a 27B model with image input support wouldn't be very high quality, but after actually using it, the generated responses feel better than those from my previously used DeepSeek-R1-Distill-Qwen-32B (Q4_K_M), and its recognition of images is also stronger than I expected. (I thought the model could only roughly understand the concepts in the image, but I didn't expect it to be able to recognize text within the image.)Since this article publishes the optimized Q4 quantized version, it would be great if it included more comparisons between the new version and my currently used unoptimized Q4 version (such as benchmark scores).(I deliberately wrote this reply in Chinese and had gemma-3-27b-it Q4_K_M translate it into English.)

Havoc24 天前

Definitely my current fav. Also interesting that for many questions the response is very similar to the gemini series. Must be sharing training datasets pretty directly.

99990000099924 天前

Assuming this can match Claude's latest, and full time usage ( as in you have a system that's constantly running code without any user input,) you'd probably save 600 to 700 a month. A 4090 is only 2K and you'll see an ROI within 90 days.I can imagine this will serve to drive prices for hosted llms lower.At this level any company that produces even a nominal amount of code should be running LMS on prem( AWS if your on the cloud).

评论 #43746595 未加载

piyh24 天前

Meta Maverick is crying in the shower getting so handily beat by a model with 15x fewer params

jarbus24 天前

Very excited to see these kinds of techniques, I think getting a 30B level reasoning model usable on consumer hardware is going to be a game changer, especially if it uses less power.

评论 #43743674 未加载

Alifatisk24 天前

Except this being lighter than the other models, is there anything else the Gemma model is specifically good at or better than the other models at doing?

评论 #43744286 未加载

评论 #43744269 未加载

评论 #43744015 未加载

holografix24 天前

Could 16gb vram be enough for the 27b QAT version?

评论 #43743704 未加载

评论 #43743825 未加载

评论 #43756253 未加载

评论 #43743634 未加载

评论 #43744249 未加载

api24 天前

When I see 32B or 70B models performing similarly to 200+B models, I don’t know what to make of this. Either the latter contains more breadth of information but we have managed to distill latent capabilities to be similar, the larger models are just less efficient, or the tests are not very good.

评论 #43744582 未加载

评论 #43744783 未加载

yuweiloopy223 天前

Been using the 27B QAT model for batch processing 50K+ internal documents. The 128K context is game-changing for our legal review pipeline. Though I wish the token generation was faster - at 20tps it's still too slow for interactive use compared to Claude Opus.

ece23 天前

On Hugging Face: <a href="https://huggingface.co/collections/google/gemma-3-qat-67ee61ccacbf2be4195c265b" rel="nofollow">https://huggingface.co/collections/google/gemma-3-qat-67ee61...</a>

briandear24 天前

The normal Gemma models seem to work fine on Apple silicon with Metal. Am I missing something?

评论 #43745048 未加载

justanotheratom24 天前

Anyone packaged one of these in an iPhone App? I am sure it is doable, but I am curious what tokens/sec is possible these days. I would love to ship "private" AI Apps if we can get reasonable tokens/sec.

评论 #43744274 未加载

评论 #43744863 未加载

评论 #43743983 未加载

评论 #43744244 未加载

punnerud24 天前

Just tested the 27B, and it’s not very good at following instructions and is very limited on more complex code problems.Mapping from one JSON with a lot of plain text, into a new structure and it fails every time.Ask it to generate SVG, and it’s very simple and almost too dumb.Nice that it doesn’t need that huge amount of RAM, and perform ok on smaller languages from my initial tests.

CyberShadow24 天前

How does it compare to CodeGemma for programming tasks?

gigel8224 天前

FWIW, the 27b Q4_K_M takes about 23Gb of VRAM with 4k context and 29Gb with 16k context and runs at ~61t/s on my 5090.

perching_aix24 天前

This is my first time trying to locally host a model - gave both the 12B and 27B QAT models a shot.I was both impressed and disappointed. Setup was piss easy, and the models are great conversationalists. I have a 12 gig card available and the 12B model ran very nice and swift.However, they're seemingly terrible at actually assisting with stuff. Tried something very basic: asked for a powershell one liner to get the native blocksize of my disks. Ended up hallucinating fields, then telling me to go off into the deep end, first elevating to admin, then using WMI, then bringing up IOCTL. Pretty unfortunate. Not sure I'll be able to put it to actual meaningful use as a result.

评论 #43744568 未加载

评论 #43747309 未加载

评论 #43744683 未加载

评论 #43748148 未加载

btbuildem24 天前

Is 27B the largest QAT Gemma 3? Given these size reductions, it would be amazing to have the 70B!

评论 #43743898 未加载

gitroom23 天前

nice, loving the push with local models lately - always makes me wonder though, you think privacy wins out over speed and convenience in the long run or people just stick with what's quickest?

评论 #43754379 未加载

noodletheworld24 天前

?Am I missing something?These have been out for a while; if you follow the HF link you can see, for example, the 27b quant has been downloaded from HF 64,000 times over the last 10 days.Is there something more to this, or is just a follow up blog post?(is it just that ollama finally has partial (no images right?) support? Or something else?)

评论 #43743700 未加载

评论 #43754518 未加载

评论 #43743748 未加载

XCSme24 天前

So how does 27b-it-qat (18GB) compare to 27b-it-q4_K_M (17GB)?

mattfrommars24 天前

anyone had success using Gemma 3 QAT models on Ollama with cline? They just don't work as good compared Gemini 2.0 flash provided by API

anshumankmr24 天前

my trusty RTX 3060 is gonna have its day in the sun... though I have run a bunch of 7B models fairly easily on Ollama.

cheriot24 天前

Is there already a Helium for GPUs?

rob_c24 天前

Given how long between this being released and this community picking up on it... Lol

评论 #43745299 未加载