Local LLM inference – impressive but too hard to work with

84 点作者 aazo1124 天前

16 条评论

Yikes what's the bar for dead simple these days? Even my totally non-technical gamer friends are messing around with ollama because I just have to give them one command to get any of the popular LLMs up and running.Now of course "non technical" here is still a pc gamer that's had to fix drivers once or twice and messaged me to ask "hey how do i into LLM, Mr. AI knower", but I don't think twice these days about showing any pc owner how to use ollama because I know I probably won't be on the hook for much technical support. My sysadmin friends are easily writing clever scripts against ollama's JSON output to do log analysis and other stuff.

评论 #43755836 未加载

Jedd24 天前

TFA seems to miss a lot of things.Mac's unified memory makes them (price-) compelling over x86 with GPU(s) for large models, say something over 24-32 GB. But a 32GB Mac doesn't take advantage of that architecture.(IIRC by default you can use 66% of RAM in < 32GB Metal boxes, and something higher in > 32GB -- though you can override that value via sysctl.)Macs can also run mlx in addition to gguf, which on these smaller models would be faster. No mention of mlx, or indeed gguf.The only model tested seems to be a distil of Deepseek R1 with Qwen - which I'd have classified as 'good, but not great'.Author bemoans all quants of that are > 5GB, which isn't true. Though with 20GB of effective VRAM to play with here, you wouldn't want to be using the Q4 (at 4.6GB).Author seems to conflate the one-off download cost (time) from hf with on-going performance cost of using the tool.No actual client-side tooling in play, either, by the looks of it, which seems odd given the claim that local inference is 'not ready as a developer platform'.The usual starting point for most devs using local LLMs is vscode + continue.dev , where the 'developer experience' is a bit more interesting than just copy-pasting to a terminal.Criterion (singular) for LLM model expertise appears to be 'text to SQL', which is fair enough if you were writing about applicability of "Local LLM Inference For Text to SQL". I'd have expected the more coding-specific (qwen2.5 coder 14B, codestral, gemma?) models would be more interesting than just one 6GB distil of R1 & Qwen.Huggingface has some functional search, though <a href="https://llm.extractum.io/list/" rel="nofollow">https://llm.extractum.io/list/</a> is a bit better in my experience, as you can tune & sort by size, vintage, licence, max context length, popularity, etc.I concur that freely available can-run-in-16GB-of-RAM models are not as good as Claude, but disagree that the user experience is as bad as painted here.

antirez24 天前

Download the model in background. Serve the client with an LLM vendor API just for the first requests, or even using that same local LLM installed on your own servers (likely cheaper). By doing so, in the long run the inference cost is near-zero and allows to use LLMs in otherwise impossible business models (like freemium).

评论 #43756514 未加载

评论 #43755847 未加载

ijk24 天前

There's two general categories of local inference:- You're running a personal hosted instance. Good for experimentation and personal use; though there's a tradeoff on renting a cloud server.- You want to run LLM inference on client machines (i.e., you aren't directly supervising it while it is running).I'd say that the article is mostly talking about the second one. Doing the first one will get you familiar enough with the ecosystem to handle some of the issues he ran into when attempting the second (e.g., exactly which model to use). But the second has a bunch of unique constraints--you want things to just work for your users, after all.I've done in-browser neural network stuff in the past (back when using TensorFlow.js was a reasonable default choice) and based on the way LLM trends are going I'd guess that edge device LLM will be relatively reasonable soon; I'm not quite sure that I'd deploy it in production this month but ask me again in a few.Relatively tightly constrained applications are going to benefit more than general-purpose chatbots; pick a small model that's relatively good at your task and train it on enough of your data and you can get a 1B or 3B model that has acceptable performance, let alone the 7B ones being discussed here. It absolutely won't replace ChatGPT (though we're getting closer to replacing ChatGPT 3.5 with small models). But if you've got a specific use case that will hold still enough to deploy a model it can definitely give you the edge versus relying on the APIs.I expect games to be one of the first to try this: per-player-action API costs murder per-user revenue, most of the gaming devices have some form of GPU already, and most games are shipped as apps so bundling a few more GB in there is, if not reasonable, at least not unprecedented.

评论 #43755923 未加载

评论 #43756645 未加载

K0balt24 天前

The only bar to using local is having the hardware and downloading the model. I find it nominally easier to use than using the openAI API since the local API isn’t picky about some of the fields (by default). Agentic flows can use local 90percent of the time and reach out to god when they need divine insight, saving 90 percent of token budgets and somewhat reducing external exposure, though I prefer to keep everything locally if possible. It’s not hard to run a 70b model locally, but the queue can get backed up with multiple users unless you have very strong hardware. Still, you can shift overflow to the cloud if you want.

zellyn24 天前

Weird to give MacBook Pro specs and omit RAM. Or did I miss it somehow? That's one of the most important factors.

评论 #43756278 未加载

评论 #43756286 未加载

bionhoward24 天前

LM Studio seems pretty good at making local models easier to use

评论 #43756348 未加载

评论 #43755938 未加载

评论 #43755892 未加载

评论 #43755816 未加载

评论 #43755873 未加载

ranger_danger24 天前

I thought llamafile was supposed to be the solution to "too hard to work with"?<a href="https://github.com/Mozilla-Ocho/llamafile">https://github.com/Mozilla-Ocho/llamafile</a>

评论 #43755178 未加载

评论 #43755291 未加载

larodi24 天前

Having done my masters on the topic of grammar-assisted text2sql let me add some additional context here:- first of all local inference can never beat cloud inference for the very simple reason that costs go down with batching. it took me two years to actually understand what batching is - the LLM tensors flowing through transformer layers has a dimension designed specifically for processing data in parallel. so no matter if you process a 1 sequence or 128 sequences the costs are the same. i've read very few articles overstating this, so bear in mind - this is the primary stopper for competing local inference with cloud inference.- second, and this is not a light one to take - LLM-assisted text2sql is not trivial, not at all. you may think it is, you may expect cutting-edge models to do it right, but there are ...plenty of reasons models fail so badly at this seemingly trivial task. you may start with arbitrary article such as <a href="https://arxiv.org/pdf/2408.14717" rel="nofollow">https://arxiv.org/pdf/2408.14717</a> and dig the references, sooner or later you will stumble on one of dozens overview papers by mostly Chinese researchers (such as <a href="https://arxiv.org/abs/2407.10956" rel="nofollow">https://arxiv.org/abs/2407.10956</a>) where overview of approaches is summarized. Caution: you may feel both inspired AI will not take over your job, or you may feel miserable how much effort is spent on this task and how badly everything fails in real-world scenarios- finally, something we agreed with a professor advising a doctorate candidate whose thesis surprisingly was on the same topic. basically given GraphQL and other structured formats such as JSON, which LLMs are much better leaned on than the complex grammar of SQL which is not a regular grammar, but context-free one, which takes more complex machines to parse it and also very often recursion.- which brings us to the most important question - why commercial GPTs fare so much better on it than local models. well, it is presumed top players, not only use MoEs but they also employ beam search, perhaps speculative inference and all sorts of optimizations on the hardware level. while this all is not beyond comprehension for a casual researcher at a casual university (like myself) you don't get to easily run this all locally. I have not written an inference engine myself, but I imagine MoE and beam search is super compled, as beam search basically means - you fork the whole LLM execution state and go back and forth. Not sure how this even works together with batching.So basically - this is too expensive. Besides atm (to my knowledge) only vllm (the engine) has some sort of reasonably working local beam search. I would've loved to see llama.cpp's beam search get a rewrite, but it stalled. Trying to get beamsearch working with current python libs is nearly impossible for commodity hardware, even if you have 48gigs of ram, which already means a very powerful GPU.

评论 #43756810 未加载

评论 #43757332 未加载

评论 #43757346 未加载

pentagrama24 天前

Local LLM would be a great idea for Mozilla to try in its Orbit [1] extension to summarize articles. But sadly, they are only going with the cloud option (for now).[1] <a href="https://orbitbymozilla.com/" rel="nofollow">https://orbitbymozilla.com/</a>

mlcq24 天前

I installed Ollama on my 64GB M1 Max and ran gemma3:27b. Well, it works, but it's a bit laggy. I use LLM quite frequently, compared to running them locally, I still prefer using the API; it's more efficient and accurate.

Havoc24 天前

Don’t think the cost point is correct. Last I saw calcs api was cheaper than the juice needed to run a local gpu. Never mind equipmentPlus there are a mountain of free tokens out there like Gemini free

segmondy24 天前

the advantage of local LLM is that you literally could find many models that have no cloud equivalent. someone could have made a fine tune to meet your needs. if you can't find a generic model that meets your need, you can get an appropriate size model you can run, build your or get dataset. then train the cloud, then use the model locally.

anarticle23 天前

Seems fine to me, I use it like a local google for software eng questions, or rig it into aider to write tests while I'm doing something else. Keeps me focused and out of my web browser frankly.EDIT: oh! It's also fantastic if you're on a plane!

aazo1124 天前

I spent a couple of weeks trying out local inference solutions for a project. Wrote up my thoughts with some performance benchmarks in a blog.TLDR -- What these frameworks can do on off the shelf laptops is astounding. However, it is very difficult to find and deploy a task specific model and the models themselves (even with quantization) are so large the download would kill UX for most applications.

评论 #43755066 未加载

Der_Einzige24 天前

Why is HN so full of people who don't know good LLM tooling?SillyTavern and vllm is right there, ready to give you a class leading experience - but you all ignore it and use stuff like LM-studio (missing tons of features that SillyTavern or even oobabooga have, like advanced samplers such as min_p or top-nsigma) or worse you use even more slow solutions like ollama or llamacpp.The real reason that folks don't like to run models on their own is that the tools have henceforth been built by obvious coomers (we all know what most people use sillytavern or comfyUI for). Just embrace the vibe set by these products instead of resisting it by forcing yourself to use shit tools.This is yet ANOTHER post I have to make about this: <a href="https://news.ycombinator.com/item?id=43743337#43743658">https://news.ycombinator.com/item?id=43743337#43743658</a>I don't care how many downvotes I get for pointing this out yet again. I'm at ICLR about to present an Oral and the vast majority of the people who'd downvote me for calling out poor tooling choices haven't done anything of note in AI before...

评论 #43768208 未加载