For my fellow Windows shills, here's how you actually build it on windows:<p>Before steps:<p>1. (For Nvidia GPU users) Install cuda toolkit <a href="https://developer.nvidia.com/cuda-downloads" rel="nofollow noreferrer">https://developer.nvidia.com/cuda-downloads</a><p>2. Download the model somewhere: <a href="https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin" rel="nofollow noreferrer">https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolv...</a><p>In Windows Terminal with Powershell:<p><pre><code> git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release
cd bin/Release
mkdir models
mv Folder\Where\You\Downloaded\The\Model .\models
.\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin --color -p "Hello, how are you, llama?" 2> $null
</code></pre>
`-DLLAMA_CUBLAS` uses cuda<p>`2> $null` is to direct the debug messages printed to stderr to a null file so they don't spam your terminal<p>Here's a powershell function you can put in your $PSPROFILE so that you can just run prompts with `llama "prompt goes here"`:<p><pre><code> function llama {
.\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin -p $args 2> $null
}
</code></pre>
adjust your paths as necessary. It has a tendency to talk to itself.
Some you may have seen this but I have a Llama 2 finetuning live coding stream from 2 days ago where I walk through some fundamentals (like RLHF and Lora) and how to fine-tune LLama 2 using PEFT/Lora on a Google Colab A100 GPU.<p>In the end with quantization and parameter efficient fine-tuning it only took up 13gb on a single GPU.<p>Check it out here if you're interested: <a href="https://www.youtube.com/watch?v=TYgtG2Th6fI">https://www.youtube.com/watch?v=TYgtG2Th6fI</a>
This covers three things:
Llama.cpp (Mac/Windows/Linux),
Ollama (Mac),
MLC LLM (iOS/Android)<p>Which is not really comprehensive... If you have a linux machine with GPUs, i'd just use hugging face inference (<a href="https://github.com/huggingface/text-generation-inference">https://github.com/huggingface/text-generation-inference</a>). And I am sure there are other things that could be covered.
Self-plug. Here’s a fork of the original llama 2 code adapted to run on the CPU or MPS (M1/M2 GPU) if available:<p><a href="https://github.com/krychu/llama">https://github.com/krychu/llama</a><p>It runs with the original weights, and gets you to ~4 tokens/sec on MacBook Pro M1 with the 7B model.
The easiest way I found was to use GPT4All. Just download and install, grab GGML version of Llama 2, copy to the models directory in the installation folder. Fire up GPT4All and run.
For most people who just want to play around and are using MacOS or Windows, I'd just recommend lmstudio.ai. Nice interface, with super easy searching and downloading of new models.
The correct answer, as always, is the oogabooga text generation webUI, which supports all of the relevant backends: <a href="https://github.com/oobabooga/text-generation-webui">https://github.com/oobabooga/text-generation-webui</a>
Don't remember if the grammar has been merged in llama.cpp yet but it would be the first step to have Llama + Stable diffusion locally to output text + images and talk to each other. The only part I'm not sure is how Llama would interpret images back. At least it could use them though, to build e.g. a webpage.
Seems to be a better guide here (without the risk curl):<p><a href="https://www.stacklok.com/post/exploring-llama-2-on-a-apple-mac-m1-m2" rel="nofollow noreferrer">https://www.stacklok.com/post/exploring-llama-2-on-a-apple-m...</a>
The LLM is impressive (llama2:13b) but appears to have been greatly limited to what you are allowed to do with it.<p>I tried to get it to generate a JSON object about the movie The Matrix and the model refuses.
I might be missing something. The article asks me to run a bash script on windows.<p>I assume this would still need to be run manually to access GPU resources etc, so can someone illuminate what is actually expected for a windows user to make this run?<p>I'm currently paying 15$ a month in a personal translation/summarizer project's ChatGPT queries. I run whisper (const.me's GPU fork) locally and would love to get the LLM part local eventually too! The system generates 30k queries a month but is not super-affected by delay so lower token rates might work too.
Maybe obvious to others, but the 1 line install command with curl is taking a long time. Must be the build step. Probably 40+ minutes now on an M2 max.
Self plug: run llama.cpp as an inference server on a spot instance anywhere: <a href="https://cedana.readthedocs.io/en/latest/examples.html#running-llama-cpp-inference" rel="nofollow noreferrer">https://cedana.readthedocs.io/en/latest/examples.html#runnin...</a>
How do you decide what model variant to use? There's a bunch of Quant method variations of Llama-2-13B-chat-GGML [0], how do you know which one to use? Reading the "Explanation of the new k-quant methods" is a bit opaque.<p>[0] <a href="https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML" rel="nofollow noreferrer">https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML</a>
If you just want to do inference/mess around with the model and have a 16GB GPU, then this[0] is enough to paste into a notebook. You need to have access to the HF models though.<p>0. <a href="https://github.com/huggingface/blog/blob/main/llama2.md#using-transformers">https://github.com/huggingface/blog/blob/main/llama2.md#usin...</a>
Idiot question: if I have access to sentence-by-sentence professionally-translated text of foreign-language-to-English in gigantic quantities, and I fed the originals as prompts and the translations as completions...<p>... would I be likely to get anything useful if I then fed it new prompts in a similar style? Or would it just generate gibberish?
Is it possible for such local install to retain conversation history so if for example you're working on a project and use it as your assistance across many days that you can continue conversations and for the model to keep track of what you and it already know?
This is usable, but hopefully folks manage to tweak it a bit further for even higher tokens/s. I’m running Llama.cpp locally on my M2 Max (32 GB) with decent performance but sticking to the 7B model for now.
I need some hand-holding .. I have a directory of over 80,000 PDF files. How do I train Llama2 on this directory and start asking questions about the material - is this even feasible?