Guide to running Llama 2 locally

683 pointsby bfirshalmost 2 years ago

29 comments

For my fellow Windows shills, here's how you actually build it on windows:Before steps:1. (For Nvidia GPU users) Install cuda toolkit <a href="https://developer.nvidia.com/cuda-downloads" rel="nofollow noreferrer">https://developer.nvidia.com/cuda-downloads</a>2. Download the model somewhere: <a href="https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin" rel="nofollow noreferrer">https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolv...</a>In Windows Terminal with Powershell:<pre><code> git clone https://github.com/ggerganov/llama.cpp cd llama.cpp mkdir build cd build cmake .. -DLLAMA_CUBLAS=ON cmake --build . --config Release cd bin/Release mkdir models mv Folder\Where\You\Downloaded\The\Model .\models .\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin --color -p "Hello, how are you, llama?" 2> $null </code></pre> `-DLLAMA_CUBLAS` uses cuda`2> $null` is to direct the debug messages printed to stderr to a null file so they don't spam your terminalHere's a powershell function you can put in your $PSPROFILE so that you can just run prompts with `llama "prompt goes here"`:<pre><code> function llama { .\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin -p $args 2> $null } </code></pre> adjust your paths as necessary. It has a tendency to talk to itself.

评论 #36872616 未加载

评论 #36876601 未加载

评论 #36884699 未加载

评论 #36885448 未加载

评论 #36879072 未加载

评论 #36884789 未加载

jawertyalmost 2 years ago

Some you may have seen this but I have a Llama 2 finetuning live coding stream from 2 days ago where I walk through some fundamentals (like RLHF and Lora) and how to fine-tune LLama 2 using PEFT/Lora on a Google Colab A100 GPU.In the end with quantization and parameter efficient fine-tuning it only took up 13gb on a single GPU.Check it out here if you're interested: <a href="https://www.youtube.com/watch?v=TYgtG2Th6fI">https://www.youtube.com/watch?v=TYgtG2Th6fI</a>

评论 #36874781 未加载

评论 #36871977 未加载

评论 #36872184 未加载

评论 #36874189 未加载

andreykalmost 2 years ago

This covers three things: Llama.cpp (Mac/Windows/Linux), Ollama (Mac), MLC LLM (iOS/Android)Which is not really comprehensive... If you have a linux machine with GPUs, i'd just use hugging face inference (<a href="https://github.com/huggingface/text-generation-inference">https://github.com/huggingface/text-generation-inference</a>). And I am sure there are other things that could be covered.

评论 #36875538 未加载

评论 #36868601 未加载

评论 #36867871 未加载

评论 #36870646 未加载

评论 #36868360 未加载

krychualmost 2 years ago

Self-plug. Here’s a fork of the original llama 2 code adapted to run on the CPU or MPS (M1/M2 GPU) if available:<a href="https://github.com/krychu/llama">https://github.com/krychu/llama</a>It runs with the original weights, and gets you to ~4 tokens/sec on MacBook Pro M1 with the 7B model.

thisisitalmost 2 years ago

The easiest way I found was to use GPT4All. Just download and install, grab GGML version of Llama 2, copy to the models directory in the installation folder. Fire up GPT4All and run.

rootusrootusalmost 2 years ago

For most people who just want to play around and are using MacOS or Windows, I'd just recommend lmstudio.ai. Nice interface, with super easy searching and downloading of new models.

评论 #36870919 未加载

评论 #36872446 未加载

评论 #36871196 未加载

评论 #36868525 未加载

评论 #36870944 未加载

Der_Einzigealmost 2 years ago

The correct answer, as always, is the oogabooga text generation webUI, which supports all of the relevant backends: <a href="https://github.com/oobabooga/text-generation-webui">https://github.com/oobabooga/text-generation-webui</a>

评论 #36869583 未加载

aledalgrandealmost 2 years ago

Don't remember if the grammar has been merged in llama.cpp yet but it would be the first step to have Llama + Stable diffusion locally to output text + images and talk to each other. The only part I'm not sure is how Llama would interpret images back. At least it could use them though, to build e.g. a webpage.

评论 #36872153 未加载

guy98238710almost 2 years ago

> curl -L "<a href="https://replicate.fyi/install-llama-cpp" rel="nofollow noreferrer">https://replicate.fyi/install-llama-cpp</a>" | bashSeriously? Pipe script from someone's website directly to bash?

评论 #36870042 未加载

评论 #36870796 未加载

评论 #36870165 未加载

评论 #36871882 未加载

评论 #36872790 未加载

评论 #36871157 未加载

jossclimbalmost 2 years ago

Seems to be a better guide here (without the risk curl):<a href="https://www.stacklok.com/post/exploring-llama-2-on-a-apple-mac-m1-m2" rel="nofollow noreferrer">https://www.stacklok.com/post/exploring-llama-2-on-a-apple-m...</a>

ericHosickalmost 2 years ago

The LLM is impressive (llama2:13b) but appears to have been greatly limited to what you are allowed to do with it.I tried to get it to generate a JSON object about the movie The Matrix and the model refuses.

评论 #36872732 未加载

oaththrowawayalmost 2 years ago

Off topic: is there a way to use one of the LLMs and have it ingest data from a SQLite database and ask it questions about it?

评论 #36870367 未加载

评论 #36869768 未加载

评论 #36870691 未加载

评论 #36869791 未加载

评论 #36869623 未加载

maxlinalmost 2 years ago

I might be missing something. The article asks me to run a bash script on windows.I assume this would still need to be run manually to access GPU resources etc, so can someone illuminate what is actually expected for a windows user to make this run?I'm currently paying 15$ a month in a personal translation/summarizer project's ChatGPT queries. I run whisper (const.me's GPU fork) locally and would love to get the LLM part local eventually too! The system generates 30k queries a month but is not super-affected by delay so lower token rates might work too.

评论 #36871080 未加载

评论 #36870585 未加载

评论 #36871543 未加载

评论 #36871736 未加载

nonethewiseralmost 2 years ago

Maybe obvious to others, but the 1 line install command with curl is taking a long time. Must be the build step. Probably 40+ minutes now on an M2 max.

评论 #36871925 未加载

nravicalmost 2 years ago

Self plug: run llama.cpp as an inference server on a spot instance anywhere: <a href="https://cedana.readthedocs.io/en/latest/examples.html#running-llama-cpp-inference" rel="nofollow noreferrer">https://cedana.readthedocs.io/en/latest/examples.html#runnin...</a>

评论 #36872467 未加载

TheAceOfHeartsalmost 2 years ago

How do you decide what model variant to use? There's a bunch of Quant method variations of Llama-2-13B-chat-GGML [0], how do you know which one to use? Reading the "Explanation of the new k-quant methods" is a bit opaque.[0] <a href="https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML" rel="nofollow noreferrer">https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML</a>

评论 #36871110 未加载

评论 #36875236 未加载

prohoboalmost 2 years ago

The thing I get peeved by is that none of the models say how much RAM/VRAM they need to run. Just list minimum specs please!

sva_almost 2 years ago

If you just want to do inference/mess around with the model and have a 16GB GPU, then this[0] is enough to paste into a notebook. You need to have access to the HF models though.0. <a href="https://github.com/huggingface/blog/blob/main/llama2.md#using-transformers">https://github.com/huggingface/blog/blob/main/llama2.md#usin...</a>

handelaaralmost 2 years ago

Idiot question: if I have access to sentence-by-sentence professionally-translated text of foreign-language-to-English in gigantic quantities, and I fed the originals as prompts and the translations as completions...... would I be likely to get anything useful if I then fed it new prompts in a similar style? Or would it just generate gibberish?

评论 #36869708 未加载

评论 #36870754 未加载

alvincodesalmost 2 years ago

I appreciate their honesty when it's in their interest that people use their API rather than run it locally.

评论 #36871627 未加载

nomandalmost 2 years ago

Is it possible for such local install to retain conversation history so if for example you're working on a project and use it as your assistance across many days that you can continue conversations and for the model to keep track of what you and it already know?

评论 #36869801 未加载

评论 #36869113 未加载

评论 #36869043 未加载

synaesthesisxalmost 2 years ago

This is usable, but hopefully folks manage to tweak it a bit further for even higher tokens/s. I’m running Llama.cpp locally on my M2 Max (32 GB) with decent performance but sticking to the 7B model for now.

boffinAudioalmost 2 years ago

I need some hand-holding .. I have a directory of over 80,000 PDF files. How do I train Llama2 on this directory and start asking questions about the material - is this even feasible?

评论 #36892067 未加载

RicoElectricoalmost 2 years ago

<pre><code> curl -L "https://replicate.fyi/windows-install-llama-cpp" </code></pre> ... returns 404 Not Found

评论 #36871079 未加载

评论 #36871112 未加载

theLiminatoralmost 2 years ago

Is it possible to do hybrid inference if I have a 24GB card with the 70B model? Ie. Offload some of it to my RAM?

ameliusalmost 2 years ago

As someone with too little spare time I'm curious, what are people using this for, except research?

technologicalalmost 2 years ago

Did anyone build pc for running these models and which one do you recommend

评论 #36874191 未加载

TastyAmphibianalmost 2 years ago

I'm still curious to know the hype behind Llama 2

politelemonalmost 2 years ago

Llama.cpp can run on Android too.