Llama.cpp guide – Running LLMs locally on any hardware, from scratch

368 pointsby zarekr6 months ago

18 comments

smcleod6 months ago

Neat to see more folks writing blogs on their experiences. This however does seem like it's an over-complicated method of building llama.cpp.Assuming you want to do this iteratively (at least for the first time) should only need to run:<pre><code> ccmake . </code></pre> And toggle the parameters your hardware supports or that you want (e.g. if CUDA if you're using Nvidia, Metal if you're using Apple etc..), and press 'c' (configure) then 'g' (generate), then:<pre><code> cmake --build . -j $(expr $(nproc) / 2) </code></pre> Done.If you want to move the binaries into your PATH, you could then optionally run cmake install.

评论 #42278825 未加载

评论 #42278214 未加载

评论 #42278327 未加载

评论 #42279758 未加载

marcodiego6 months ago

First time I heard about Llama.cpp I got it to run on my computer. Now, my computer: a Dell laptop from 2013 with 8Gb RAM and an i5 processor, no dedicated graphic card. Since I wasn't using a MGLRU enabled kernel, It took a looong time to start but wasn't OOM-killed. Considering my amount of RAM was just the minimum required, I tried one of the smallest available models.Impressively, it worked. It was slow to spit out tokens, at a rate around a word each 1 to 5 seconds and it was able to correctly answer "What was the biggest planet in the solar system", but it quickly hallucinated talking about moons that it called "Jupterians", while I expected it to talk about Galilean Moons.Nevertheless, LLM's really impressed me and as soon as I get my hands on better hardware I'll try to run other bigger models locally in the hope that I'll finally have a personal "oracle" able to quickly answers most questions I throw at it and help me writing code and other fun things. Of course, I'll have to check its answers before using them, but current state seems impressive enough for me, specially QwQ.Is Any one running smaller experiments and can talk about your results? Is it already possible to have something like an open source co-pilot running locally?

评论 #42276869 未加载

评论 #42275863 未加载

评论 #42278234 未加载

评论 #42275703 未加载

评论 #42276402 未加载

评论 #42282362 未加载

wing-_-nuts6 months ago

Llama.cpp is one of those projects that I want to install, but I always just wind up installing kobold.cpp because it's simply miles better with UX.

评论 #42276389 未加载

评论 #42276213 未加载

评论 #42275817 未加载

评论 #42279787 未加载

评论 #42275793 未加载

superkuh6 months ago

I'd say avoid pulling in all the python and containers required and just download the gguf from huggingface website directly in a browser rather than doing is programmatically. That sidesteps a lot of this project's complexity since nothing about llama.cpp requires those heavy deps or abstractions.

评论 #42278264 未加载

arendtio6 months ago

I tried building and using llama.cpp multiple times, and after a while, I got so frustrated with the frequently broken build process that I switched to ollama with the following script:<pre><code> #!/bin/sh export OLLAMA_MODELS="/mnt/ai-models/ollama/" printf 'Starting the server now.\n' ollama serve >/dev/null 2>&1 & serverPid="$!" printf 'Starting the client (might take a moment (~3min) after a fresh boot).\n' ollama run llama3.2 2>/dev/null printf 'Stopping the server now.\n' kill "$serverPid" </code></pre> And it just works :-)

评论 #42277196 未加载

dmezzetti6 months ago

Seeing a lot of Ollama vs running llama.cpp direct talk here. I agree that setting up llama.cpp with CUDA isn't always the easiest. But there is a cost to running all inference over HTTPS. Local in-program inference will be faster. Perhaps that doesn't matter in some cases but it's worth noting.I find that running PyTorch is easier to get up and running. For quantization, AWQ models work and it's just a "pip install" away.

slavik816 months ago

FYI, if you're on Ubuntu 24.04, it's easy to build llama.cpp with AMD ROCm GPU acceleration. Debian enabled support for a wider variety of hardware than is available in the official AMD packages, so this should work for nearly all discrete AMD GPUs from Vega onward (with the exception of MI300, because Ubuntu 24.04 shipped with ROCm 5):<pre><code> sudo apt -y install git wget hipcc libhipblas-dev librocblas-dev cmake build-essential # add yourself to the video and render groups sudo usermod -aG video,render $USER # reboot to apply the group changes # download a model wget --continue -O dolphin-2.2.1-mistral-7b.Q5_K_M.gguf \ https://huggingface.co/TheBloke/dolphin-2.2.1-mistral-7B-GGUF/resolve/main/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf?download=true # build llama.cpp git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp git checkout b3267 HIPCXX=clang++-17 cmake -S. -Bbuild \ -DGGML_HIPBLAS=ON \ -DCMAKE_HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx1010;gfx1030;gfx1100;gfx1101;gfx1102" \ -DCMAKE_BUILD_TYPE=Release make -j8 -C build # run llama.cpp build/bin/llama-cli -ngl 32 --color -c 2048 \ --temp 0.7 --repeat_penalty 1.1 -n -1 \ -m ../dolphin-2.2.1-mistral-7b.Q5_K_M.gguf \ --prompt "Once upon a time" </code></pre> I think this will also work on Rembrandt, Renoir, and Cezanne integrated GPUs with Linux 6.10 or newer, so you might be able to install the HWE kernel to get it working on that hardware.With that said, users with CDNA 2 or RDNA 3 GPUs should probably use the official AMD ROCm packages instead of the built-in Ubuntu packages, as there are performance improvements for those architectures in newer versions of rocBLAS.

HarHarVeryFunny6 months ago

What are the limitations on which LLMs (specific transformer variants etc) llama.cpp can run? Does it require the input mode/weights to be in some self-describing format like ONNX that support different model architectures as long as they are built out of specific module/layer types, or does it more narrowly only support transformer models parameterized by depth, width, etc?

nobodyandproud6 months ago

This was nice. I took the road less traveled and tried building on Windows and AMD.Spoiler: Vulkan with MSYS2 was indeed the easiest to get up and running.I actually tried w64devkit first and it worked properly for llama-server, but there were inexplicable plug-in problems with llama-bench.Edit: I tried w64devkit before I read this write-up and I was left wondering what to try next, so the timing was perfect.

smcleod6 months ago

Somewhat related - on several occasions I've come across the claim that _"Ollama is just a llama.cpp wrapper"_, which is inaccurate and completely misses the point. I am sharing my response here to avoid repeating myself repeatedly.With llama.cpp running on a machine, how do you connect your LLM clients to it and request a model gets loaded with a given set of parameters and templates?... you can't, because llama.cpp is the inference engine - and it's bundled llama-cpp-server binary only provides relatively basic server functionality - it's really more of demo/example or MVP.Llama.cpp is all configured at the time you run the binary and manually provide it command line args for the one specific model and configuration you start it with.Ollama provides a server and client for interfacing and packaging models, such as:<pre><code> - Hot loading models (e.g. when you request a model from your client Ollama will load it on demand). - Automatic model parallelisation. - Automatic model concurrency. - Automatic memory calculations for layer and GPU/CPU placement. - Layered model configuration (basically docker images for models). - Templating and distribution of model parameters, templates in a container image. - Near feature complete OpenAI compatible API as well as it's native native API that supports more advanced features such as model hot loading, context management, etc... - Native libraries for common languages. - Official container images for hosting. - Provides a client/server model for running remote or local inference servers with either Ollama or openai compatible clients. - Support for both an official and self hosted model and template repositories. - Support for multi-modal / Vision LLMs - something that llama.cpp is not focusing on providing currently. - Support for serving safetensors models, as well as running and creating models directly from their Huggingface model ID. </code></pre> In addition to the llama.cpp engine, Ollama are working on adding additional model backends (e.g. things like exl2, awq, etc...).Ollama is not "better" or "worse" than llama.cpp because it's an entirely different tool.

评论 #42278711 未加载

评论 #42279399 未加载

notadoc6 months ago

Ollama is so easy, what's the benefit to Llama.cpp?

评论 #42278272 未加载

marcantonio6 months ago

I set up llama.cop last week on my M3. Was fairly simple via homebrew. However, I get tags like <|imstart|> in the output constantly. Is there a way to filter them out with llama-server? Seems like a major usability issue if you want to use llama.cpp by itself (with the web interface).ollama didn’t have the issue, but it’s less configurable.

secondcoming6 months ago

I just gave this a shot on my laptop and it works reasonably well considering it has no discrete GPU.One thing I’m unsure of is how to pick a model. I downloaded the 7B one from Huggingface, but how is anyone supposed to know what these models are for, or if they’re any good?

评论 #42280151 未加载

varispeed6 months ago

I use ChatGPT and Claude daily, but I can't see a use case for why would I use LLM outside of these services.What do you use Llama.cpp for?I get you can ask it a question in natural language and it will spit out sort of an answer, but what would you do with it, what do you ask it?

评论 #42278977 未加载

评论 #42280159 未加载

inLewOf6 months ago

re Temperature config option: I've found it useful for trying to generate something akin to a sampling-based confidence score for chat completions (e.g., set the temperature a bit high, run the model a few times and calculate the distribution of responses). Otherwise haven't figured out a good way to get confidence scores in llama.cpp (Been tracking this git request to get log_probs <a href="https://github.com/ggerganov/llama.cpp/issues/6423">https://github.com/ggerganov/llama.cpp/issues/6423</a>)

niek_pas6 months ago

Can someone tell me what the advantages are of doing this over using, e.g., the ChatGPT web interface? Is it just a privacy thing?

评论 #42276051 未加载

评论 #42276001 未加载

评论 #42276009 未加载

评论 #42276524 未加载

评论 #42276014 未加载

评论 #42276412 未加载

评论 #42276174 未加载

评论 #42276010 未加载

评论 #42277442 未加载

NoZZz6 months ago

You can also just download LMStudio for free, works out of the box.

nothrowaways6 months ago

There are many open source alternatives to LMstudio that work just as good.

18 comments

smcleod6 months ago

评论 #42278825 未加载

评论 #42278214 未加载

评论 #42278327 未加载

评论 #42279758 未加载

marcodiego6 months ago

评论 #42276869 未加载

评论 #42275863 未加载

评论 #42278234 未加载

评论 #42275703 未加载

评论 #42276402 未加载

评论 #42282362 未加载

wing-_-nuts6 months ago

Llama.cpp is one of those projects that I want to install, but I always just wind up installing kobold.cpp because it's simply miles better with UX.

评论 #42276389 未加载

评论 #42276213 未加载

评论 #42275817 未加载

评论 #42279787 未加载

评论 #42275793 未加载

superkuh6 months ago

评论 #42278264 未加载

arendtio6 months ago

评论 #42277196 未加载

dmezzetti6 months ago

slavik816 months ago

HarHarVeryFunny6 months ago

nobodyandproud6 months ago

smcleod6 months ago

评论 #42278711 未加载

评论 #42279399 未加载

notadoc6 months ago

Ollama is so easy, what's the benefit to Llama.cpp?

评论 #42278272 未加载

marcantonio6 months ago

secondcoming6 months ago

评论 #42280151 未加载

varispeed6 months ago

评论 #42278977 未加载

评论 #42280159 未加载

inLewOf6 months ago

niek_pas6 months ago

Can someone tell me what the advantages are of doing this over using, e.g., the ChatGPT web interface? Is it just a privacy thing?