Ask HN: What is the current (Apr. 2024) gold standard of running an LLM locally?

195 点作者 js98大约 1 年前

There are many options and opinions about, what is currently the recommended approach for running an LLM locally (e.g., on my 3090 24Gb)? Are options ‘idiot proof’ yet?

26 条评论

CuriouslyC大约 1 年前

Given you're running a 3090 24gb, go with Oobabooga/Sillytavern, and don't come here for advice on this stuff, go to <a href="https://www.reddit.com/r/LocalLLaMA/" rel="nofollow">https://www.reddit.com/r/LocalLLaMA/</a>, they usually have a "best current local model" thread pinned, and you're less likely to get incorrect advice.

评论 #39893796 未加载

评论 #39896472 未加载

评论 #39894143 未加载

评论 #39893933 未加载

评论 #39895534 未加载

aantix大约 1 年前

Ollama is really easy.brew install ollamabrew services start ollamaollama pull mistralOllama you can query via http. It provides a consistent interface for prompting, regardless of model.<a href="https://github.com/ollama/ollama/blob/main/docs/api.md#request">https://github.com/ollama/ollama/blob/main/docs/api.md#reque...</a>

评论 #39893765 未加载

评论 #39893340 未加载

评论 #39893970 未加载

评论 #39894263 未加载

评论 #39898549 未加载

keriati1大约 1 年前

We run coding assistance models on MacBook Pros locally, so here is my experience: On hardware side I recommend Apple M1 / M2 / M3 with at least 400Gb/s memory bandwidth. For local coding assistance this is perfect for 7B or 33B models.We also run a Mac Studio with a bigger model (70b), M2 ultra and 192GB ram, as a chat server. It's pretty fast. Here we use Open WebUI as interface.Software wise Ollama is OK as most IDE plugins can work with it now. I personally don't like the go code they have. Also some key features are missing from it that I would need and those are just never getting done, even as multiple people submitted PRs for some.LM Studio is better overall, both as server or as chat interface.I can also recommend CodeGPT plugin for JetBrains products and Continue plugin for VSCode.As a chat server UI as I mentioned Open WebUI works great, I use it with together ai too as backend.

评论 #39895957 未加载

评论 #39899958 未加载

bababuriba大约 1 年前

LMStudio has an easy interface, you can browse/search for models, it has a server if you need an API, etc.it's pretty 'idiot proof', if you ask me.<a href="https://lmstudio.ai" rel="nofollow">https://lmstudio.ai</a>

bluedino大约 1 年前

Newbie questions:What do you do with one of these?Does it generate images? Write code? Can you ask it generic questions?Do you have to 'train' it?Do you need a large amount of storage to hold the data to train the model on?

评论 #39894285 未加载

评论 #39894296 未加载

jillesvangurp大约 1 年前

Gpt4all and Simon Willison's llm python tool are a nice way to get started; even on a modest laptop. A modest 14" mac book pro with 16GB goes a long way with most of the 7B models. Anything larger you need more ram.

chasd00大约 1 年前

If you want a code centric interface it's pretty easy to use llangchain with local models. For example, you specify a model from hugging face and it will download and run it locally.<a href="https://python.langchain.com/docs/get_started/introduction" rel="nofollow">https://python.langchain.com/docs/get_started/introduction</a>I like llangchain but it can get complex for use cases beyond a simple "give the llm a string, get a string back". I've found myself spending more time in llangchain docs than working on my actual idea/problem. However, it's still a very good framework and they've done an amazing job IMO.edit: "Are options ‘idiot proof’ yet?" - from my limited experience, Ollama is about as straightforward as it gets.

评论 #39898082 未加载

ein0p大约 1 年前

FWIW I use Ollama and Open WebUI. Ollama uses one of my RTX A6000 GPUs. I also use Ollama on my M3 MacBook Pro. Open WebUI even supports multimodal models.

theshrike79大约 1 年前

I use LM Studio[0] on my M-series macs and it's pretty plug and play.I've got an Ollama instance running on a VPS providing a backend for a discord bot.[0] <a href="https://lmstudio.ai" rel="nofollow">https://lmstudio.ai</a>

teeray大约 1 年前

Also, what are the recommended hardware options these days?

评论 #39894194 未加载

评论 #39893974 未加载

TachyonicBytes大约 1 年前

I still feel that llamafiles[1] are the easiest way to do this, on most architectures. It's basically just running a binary with a few command-line options, seems pretty close to what you describe you want.[1] <a href="https://github.com/Mozilla-Ocho/llamafile">https://github.com/Mozilla-Ocho/llamafile</a>

notjulianjaynes大约 1 年前

Jumping on this as a fellow idiot to ask for suggestions on having a local LLM generate a summary from a transcript with multiple speakers. How important is it that the transcript is well formatted (diarization, etc) first. My attempts have failed miserably thus far.Edit: using a P40, whisper as ASR

评论 #39894360 未加载

deathmonger5000大约 1 年前

I created a tool called Together Gift It because my family was sending an insane amount of gift related group texts during the holidays. My favorite one was when someone included the gift recipient in the group text about what gift we were getting for the person.Together Gift It solves the problem the way you’d think: with AI. Just kidding. It solves the problem by keeping everything in one place. No more group texts. There are wish lists and everything you’d want around that type of thing. There is also AI.<a href="https://www.togethergiftit.com/" rel="nofollow">https://www.togethergiftit.com/</a>

评论 #39957809 未加载

ActorNightly大约 1 年前

Ollama is the easiest. For coding, use VSCode Continue extension and point it to ollama server.The thing to watch out for (if you have exposable income) is new RTX 5090. Rumors are floating they are going to have 48gb of ram per card. But if not, the ram bandwidth is going to be a lot faster. People who are on 4090 or 3090s doing ML are going to go to those, so you can pick up a second 3090 for cheap at which point you can load higher parameter models, however you will have to learn hugging face Accelerate library to support multi gpu inference (not hard, just some reading trial/error).

mateuszbuda大约 1 年前

Anyone can share experience with <a href="https://ollama.com/">https://ollama.com/</a> ?

评论 #39894038 未加载

评论 #39894002 未加载

评论 #39895516 未加载

评论 #39893813 未加载

评论 #39893368 未加载

nlittlepoole大约 1 年前

<a href="https://jan.ai/" rel="nofollow">https://jan.ai/</a> is pretty idiot proof.

0xbadc0de5大约 1 年前

<a href="https://lmstudio.ai/" rel="nofollow">https://lmstudio.ai/</a> + Mixtral 8x7b

borissk大约 1 年前

Interesting question, I'd like to know this also.Guess it's going to be a variant of Llama or Grok.

whimsicalism大约 1 年前

It depends if you want ease or speed and if you are batching.Ease? Probably ollamaSpeed and you are batching on gpu? vLLM

EarthAmbassador大约 1 年前

Where can we find detailed info, or better a tool, to determine gear compatability?

jsight大约 1 年前

ollama is the easiest, IMO. The CLI interface is pretty good too.gpt4all is decent as well, and also provides a way to retrieve information from local documents.

LorenDB大约 1 年前

Ollama is about as idiot proof as you'll get.

resource_waste大约 1 年前

oobabooga + berkley sterling LMSeriously, this is the insane duo that can get you going in moments with chatgpt3.5 quality.

tamarlikesdata大约 1 年前

Hugging Face Transformers is your best bet. It's pretty straightforward and has solid docs, but you'll need to get your hands dirty a bit with setup and configs.For squeezing every bit of performance out of your GPU, check out ONNX or TensorRT. They're not exactly plug-and-play, but they're getting easier to use.And yeah, Docker can make life a bit easier by handling most of the setup mess for you. Just pull a container and you're more or less good to go.It's not quite "idiot-proof" yet, but it's getting there. Just be ready to troubleshoot and tinker a bit.

database-theory大约 1 年前

Paywall article: <a href="https://towardsdatascience.com/how-to-build-a-local-open-source-llm-chatbot-with-rag-f01f73e2a131" rel="nofollow">https://towardsdatascience.com/how-to-build-a-local-open-sou...</a>Source code: <a href="https://github.com/leoneversberg/llm-chatbot-rag">https://github.com/leoneversberg/llm-chatbot-rag</a>

benreesman大约 1 年前

The current HYPERN // MODERN // AI builds are using flox 1.0.2 to install llama.cpp. The default local model is dolphin-8x7b at ‘Q4_KM’ quantization (it lacks defaults for Linux/NXIDIA, that’s coming soon and it works, you just have to configure it manually and Mac gets more love because that’s what myself and the other main contributors have).flox will also install properly accelerated torch/transformers/sentence-transfomers/diffusers/etc: they were kind enough to give me a preview of their soon-to-be-released SDXL environment suite (please don’t hold them to my “soon”, I just know it looks close to me). So you can do all the modern image stuff pretty much up to whatever is on HuggingFace.I don’t have the time I need to be emphasizing this, but the last piece before I’m going to open source this is I’ve got a halfway decent sketch of a binary replacement/conplement for the OpenAI-compatible JSON/HTTP one everyone is using now.I have incomplete bindings to whisper.cpp and llama.cpp for those modalities, and when it’s good enough I hope the bud.build people will accept it as a donation to the community managed ConnectRPC project suite.We’re really close to a plausible shot at open standards on this before NVIDIA or someone totally locks down the protocol via the RT stuff.edit: I almost forgot to mention. We have decent support for multi-vendor, mostly in practice courtesy of the excellent ‘gptel’, though both nvim and VSCode are planned for out-of-the-box support too.The gap is opening up a bit again between the best closed and best open models.This is speculation but I strongly believe the current opus API-accessible build is more than a point release, it’s a fundamental capability increase (though it has a weird BPE truncation issue that could just be a beta bug, but it could hint at something deeper.It can produce verbatim artifacts from 100s of thousands of tokens ago and restart from any branch in the context, takes dramatically longer when it needs to go deep, and claims it’s accessing a sophisticated memory hierarchy. Personally I’ve never been slackjawed with amazement on anything in AI except my first night with SD and this thing.