I have tried a lot of local models. I have 656GB of them on my computer so I have experience with a diverse array of LLMs. Gemma has been nothing to write home about and has been disappointing every single time I have used it.<p>Models that are worth writing home about are;<p>EXAONE-3.5-7.8B-Instruct - It was excellent at taking podcast transcriptions and generating show notes and summaries.<p>Rocinante-12B-v2i - Fun for stories and D&D<p>Qwen2.5-Coder-14B-Instruct - Good for simple coding tasks<p>OpenThinker-7B - Good and fast reasoning<p>The Deepseek destills - Able to handle more complex task while still being fast<p>DeepHermes-3-Llama-3-8B - A really good vLLM<p>Medical-Llama3-v2 - Very interesting but be careful<p>Plus more but not Gemma.
I wrote a mini guide on running Gemma 3 at <a href="https://docs.unsloth.ai/basics/tutorial-how-to-run-gemma-3-effectively">https://docs.unsloth.ai/basics/tutorial-how-to-run-gemma-3-e...</a>!<p>The recommended settings according to the Gemma team are:<p>temperature = 0.95<p>top_p = 0.95<p>top_k = 64<p>Also beware of double BOS tokens! You can run my uploaded GGUFs with the recommended chat template and settings via ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q4_K_M
See the other HN submission (for the Gemma3 technical report doc) for a more active discussion thread - 50 comments at time of writing this.<p><a href="https://news.ycombinator.com/item?id=43340491">https://news.ycombinator.com/item?id=43340491</a>
Small Models should be train on specific problem in specific language, and should be built one upon another, the way container works. I see a future where a factory or home have local AI server which have many highly specific models, continuously being trained by super large LLM on the web, and are connected via network to all instruments and computer to basically control whole factory. I also see a future where all machinery comes with AI-Readable language for their own functioning. A http like AI protocol for two way communication between machine and an AI. Lots of possibility.
After reading the technical report do the effort of downloading the model and run it against a few prompts. In 5 minutes you understand how broken LLM benchmarking is.
No mention of how well it's claimed to perform with tool calling?<p>The Gemma series of models has historically been pretty poor when it comes to coding and tool calling - two things that are very important to agentic systems, so it will be interesting to see how 3 does in this regard.
Not sure if anyone else experiences this, but ollama downloads starts off strong but the last few MBs take forever.<p>Finally just finished downloading (gemma3:27b). Requires the latest version of Ollama to use, but now working, getting about 21 tok/s on my local 2x A4000.<p>From my few test prompts looks like a quality model, going to run more tests to compare against mistral-small:24b to see if it's going to become my new local model.
The claim of “strongest” (what does that even mean?) seems moot. I don’t think a multimodal model is the way to go to use on single, home, GPUs.<p>I would much rather have specific tailored models to use in different scenarios, that could be loaded into the GPU when needed. It’s a waste of parameters to have half of the VRAM loaded with parts of the model targeting image generation when all I want to do is write code.
How does it compare to OlympicCoder 7B [0] which allegedly beats Claude Sonnet 3.7 in the International Olympiad in Informatics [1] ?<p>[0] <a href="https://huggingface.co/open-r1/OlympicCoder-7B?local-app=vllm" rel="nofollow">https://huggingface.co/open-r1/OlympicCoder-7B?local-app=vll...</a><p>[1] <a href="https://pbs.twimg.com/media/GlyjSTtXYAAR188?format=jpg&name=4096x4096" rel="nofollow">https://pbs.twimg.com/media/GlyjSTtXYAAR188?format=jpg&name=...</a>
My usual non-scientific benchmark is asking it to implement the game Tetris in python, and then iterating with the LLM to fix/tweak it.<p>My prompt to Gemma 27b (q4) on open webui + ollama: "Can you create the game tetris in python?"<p>It immediately starts writing code. After the code is finished, I noticed something very strange, it starts a paragraph like this:<p>"
Key improvements and explanations:<p><pre><code> Clearer Code Structure: The code is now organized into a Tetris class, making it much more maintainable and readable. This is essential for any non-trivial game.</code></pre>
"<p>Followed by a bunch of fixes/improvements, as if this was not the first iteration of the script.<p>I also notice a very obvious error: In the `if __name__ == '__main__':` block, it tries to instantiate a `Tetris` class, when the name of the class it created was "TetrisGame".<p>Nevertheless, I try to run it and paste the `NameError: name 'Tetris' is not defined` error along with stack trace specifying the line. Gemma then gives me this response:<p>"The error message "NameError: name 'Tetris' is not defined" means that the Python interpreter cannot find a class or function named Tetris. This usually happens when:"<p>Then continues with a generic explanation with how to fix this error in arbitrary programs. It seems like it completely ignored the code it just wrote.
These bar charts are getting more disingenuous every day. This one makes it seem like Gemma3 ranks as nr. 2 on the arena just behind the full DeepSeek R1. But they just cut out everything that ranks higher. In reality, R1 currently ranks as nr. 6 in terms of Elo. It's still impressive for such a small model to compete with much bigger models, but at this point you can't trust any publication by anyone who has any skin in model development.
Discrete GPUs are finished for AI.<p>They've had years to provide the needed memory but can't/won't.<p>The future of local LLMs is APUs such as Apple M series and AMD Strix Halo.<p>Within 12 months everyone will have relegated discrete GPUs to the AI dustbin and be running 128GB to 512GB of delicious local RAM with vastly more RAM than any discrete GPU could dream of.
PSA: DO NOT USE OLLAMA FOR TESTING.<p>Ollama silently (!!!) drops messages if the context window is exceeded (instead of, you know, just erroring? who in the world made this decision).<p>The workaround until now was to (not use ollama or) make sure to only send a single message. But now they seem to silently truncate single messages as well, instead of erroring! (this explains the sibling comment where a user could not reproduce the results locally).<p>Use LM Studio, llama.cpp, openrouter or anything else, but stay away from ollama!