TFA seems to miss a lot of things.<p>Mac's unified memory makes them (price-) compelling over x86 with GPU(s) for large models, say something over 24-32 GB. But a 32GB Mac doesn't take advantage of that architecture.<p>(IIRC by default you can use 66% of RAM in < 32GB Metal boxes, and something higher in > 32GB -- though you can override that value via sysctl.)<p>Macs can also run mlx in addition to gguf, which on these smaller models would be faster. No mention of mlx, or indeed gguf.<p>The only model tested seems to be a distil of Deepseek R1 with Qwen - which I'd have classified as 'good, but not great'.<p>Author bemoans all quants of that are > 5GB, which isn't true. Though with 20GB of effective VRAM to play with here, you wouldn't <i>want</i> to be using the Q4 (at 4.6GB).<p>Author seems to conflate the one-off download cost (time) from hf with on-going performance cost of <i>using</i> the tool.<p>No actual client-side tooling in play, either, by the looks of it, which seems odd given the claim that local inference is 'not ready as a developer platform'.<p>The usual starting point for most devs using local LLMs is vscode + continue.dev , where the 'developer experience' is a bit more interesting than just copy-pasting to a terminal.<p>Criterion (singular) for LLM model expertise appears to be 'text to SQL', which is fair enough if you were writing about applicability of "Local LLM Inference For Text to SQL". I'd have expected the more coding-specific (qwen2.5 coder 14B, codestral, gemma?) models would be more interesting than just one 6GB distil of R1 & Qwen.<p>Huggingface has some functional search, though <a href="https://llm.extractum.io/list/" rel="nofollow">https://llm.extractum.io/list/</a> is a bit better in my experience, as you can tune & sort by size, vintage, licence, max context length, popularity, etc.<p>I concur that freely available can-run-in-16GB-of-RAM models are not as good as Claude, but disagree that the user experience is as bad as painted here.