Not super knowledgeable about all the different specs of the different Orange PI and Rasberry PI models. I'm looking for something relatively cheap that can connect to WiFi and USB. I want to be able to run at least 13b models at a a decent tok / s.<p>Also open to other solutions. I have a Mac M1 (8gb RAM) and upgrading the computer itself would be cost prohibitive for me.
Back in April I bought some parts to build a PC for testing LLMs with llama.cpp. I paid around $192 for: a B550MH motherboard, AMD Ryzen 3 4100, 1x16GB DDR4 Kingston ValueRAM, 256GB M.2 SSD. I already had an old PC case with a 350W PSU.<p>I was getting 2.2 tokens/s with the llama-2-13b-chat.Q4_K_M.gguf and 3.3 tokens/s with llama-2-13b-chat.Q3_K_S.gguf. With Mistral and Zephyr, the Q4_K_M versions, I was getting 4.4 tokens/s.<p>A few days ago I bought another stick of 16GB RAM ($30) and for some reason that escapes me, the inference speed doubled. So now I'm getting 6.5 tokens/s with llama-2-13b-chat.Q3_K_S.gguf, which for my needs gives the same results as Q4_K_M, and 9.1 tokens/s with Mistral and Zephyr. Personally, I can barely keep up with reading at 9 tokens/s (if I also have to process the text and check for errors).<p>If I wasn't considering getting an Nvidia 4060 Ti for Stable Diffusion, I would seriously be considering a used RX 580 8GB ($75) and run Llama Q4_K_M entirely on the GPU or offload some layers when using a 30B model.
For a 13B model, depending on what quantization you choose, you’re going to need a system with at least 16GB RAM, and even my AMD Ryzen 7 5800X at full throttle feels a bit sluggish on 7B Llama - a Raspberry Pi or similar would be painful.<p>Here is the best explanation I’ve found so far, covering various trade-offs and scenarios: <a href="https://www.hardware-corner.net/guides/computer-to-run-llama-ai-model/" rel="nofollow noreferrer">https://www.hardware-corner.net/guides/computer-to-run-llama...</a><p>In your shoes, not being in the position to spend much right now, I’d try a few different 7B models at 4 and 5 bit quantizations on the Mac, which is going to be better than just about any other 8GB RAM system, and look into using cloud for larger stuff (remember to fully deallocate the VM when done for the day!)
Unfortunately a 13B model quantized down to even 2 or 3 bits won't fit into 8gb of ram. You might have to settle for 7B models. I'm currently getting great results out of the Mistral-Openorca-7B model. At 5 bits it's using about 5.5gb of ram and running at 14 tokens/sec on my M2 MacBook Air (24GB ram, but in theory would work on 8gb if MacOS will allow you to allocate that much ram to the GPU). As a quick test, forcing it into CPU mode, I'm still getting 11 tokens/sec. It does seem to take about 2-3 times longer to initialize itself and get into the "ready state" when using the CPU, however.
Clone llama.cpp project from GitHub. Download LLM models from HuggingFace. TheBloke user posts lot of models in GGUF format. GPU is not needed to run these. 13B models should be fast modern computer CPU. llama.cpp offers batch mode, interactive chat mode and also web server mode.
Here’s an m1 guide I found.
<a href="https://simonwillison.net/2023/Aug/1/llama-2-mac/" rel="nofollow noreferrer">https://simonwillison.net/2023/Aug/1/llama-2-mac/</a><p>Start with a 7B model then go from there.
I used kobold ai but that didn’t seem too well recommended for macOS.<p>Raspberry pi 4B can do 3B models or 7B at one question per hour or so for now. Can quantize them for faster but then the answers are worse.
Depends what you mean by "local". If you mean in your own home, then there isn't a particularly cheap way unless you have a decent spare machine. If you mean "I get to control everything myself" then you can rent a cheap VPS on a value host like Contabo (you can get 8cores, 30gb of ram, and 1tb SSD on Ubuntu 22.04 for something like $35/month-- just stick the to US data centers).<p>Then if you want something that is extremely quick and easy to set up and provides a convenient REST api for completions/embeddings with some other nice features, you might want to check out my project here:<p><a href="https://github.com/Dicklesworthstone/swiss_army_llama">https://github.com/Dicklesworthstone/swiss_army_llama</a><p>Especially if you use Docker to set it up, you can go from a brand new box to a working setup in under 20 minutes and then access it via the Swagger page from any browser.
You need lots of memory. 16G will easily run 7B quantized models.<p>Not the cheapest by far, but I recently bought a 32G internal memory M2 Pro Mac mini. I can run about four 7B models concurrently. I was able to run a 30B quantized model without page faults, but I killed most user land processes.<p>Also not what you are asking for, but I pay Google $10/month for Colab Pro and I can usually get an A100 whenever I request one. Between Colab and my 32G M2 box, I am very satisfied. Before I found good quantized models to run, I would rent a VPS by the hour from Lambda Labs, and that was a great experience, but I don’t need to do that now.<p>EDIT: on the M2 Pro, I get 25 to 30 tokens per second.<p>EDIT #2: I wrote a short blog yesterday on the best resources I have found so far for running on my Mac <a href="https://marklwatson.substack.com/p/running-open-llm-models-on-apple" rel="nofollow noreferrer">https://marklwatson.substack.com/p/running-open-llm-models-o...</a>
Hm, it is unclear to me if you plan to use some PIs or your Mac M1.<p>In case it's the latter, I recently used Ollama[1] and boy was it good! Installation was a breeze, downloading/using models is very easy and performance on my M1 was quite good for the Mistral 7B model.<p>1: <a href="https://ollama.ai/">https://ollama.ai/</a>
What is a decent tok/s?<p>Your best bet is to run a quantized 7b model using LMStudio or Ollamma on your M1 Mac, like neural chat v3.1 from Mistral/Intel.
Quick question - Does anyone know if a cluster of 6 Jetson Tegra K1 could run a 12b or 20b model? I haven't been able to get consistent information regarding the ai capability of the boards.
Silly question: I'm thinking of dipping my toes into local LLM.<p>Let's say I point my resources at getting one up and running that outputs tokens in an acceptable manner - then what? What can I do with a local LLM?
I've got an orca-3b GGML (koboldcpp) running on an RPi 4 and it sucks. It takes a few <i>minutes</i> just to process the prompt, then it's 1 token per second of output.<p>...which is usually crap (because it's only 3b) and needs to be regenerated anyway. It's not a viable solution for any generative use case. Mechanical Turk is faster and more reliable.<p>There are smaller models that I could try but 7b is already the lower limit of my patience. YMMV