I'm wondering how small of a model can be "generally intelligent" (as in LLM intelligent, not AGI). Like there must be a size too small to hold "all the information" in.<p>And I also wonder at what point we'll see specialized small models. Like if I want help coding, it's probably ok if the model doesn't know who directed "Jaws". I suspect that is the future: many small, specialized models.<p>But maybe training compute will just get to the point where we can run a full-featured model on our desktop (or phone)?
Maybe Llama 3.3 70B doesn't count as running on "one GPU", but it certainly runs just fine on one Mac, and in my tests it's far better at holding onto concepts over a longer conversation than Gemma 3 is, which starts getting confused after about 4000 tokens.
Gemma 3 is a lot better at writing for sure, compared to 2, but the big improvement is I can actually use a 32k+ context window and not have it start flipping out with random garbage.
Technically, the 1.58-bit Unsloth quant of DeepSeek R1 runs on a single GPU+128GB of system RAM. It performs amazingly well, but you'd better not be in a hurry.
I found Mistral Small 3.1, which released slightly after Gemma 3, much better.<p>Much fewer refusals, more accurate, less babbling, generally better, but especially at coding.
Google post from last week: <a href="https://blog.google/technology/developers/gemma-3/" rel="nofollow">https://blog.google/technology/developers/gemma-3/</a>
My instinct is that it would be cheaper overall to buy API credits when needed, compared with buying a top-of-the-line GPU which sits idle for most of the day. That also opens up access to larger models.
Does anyone use GoogleAI? For an AI Company with an AI Ceo using AI language translation, I think their actual GPT products are all terrible and have a terrible rep. And who wants their private conversation shipped back to google for spying?