TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Guide to running Llama 2 locally

683 pointsby bfirshalmost 2 years ago

29 comments

shortrounddev2almost 2 years ago
For my fellow Windows shills, here&#x27;s how you actually build it on windows:<p>Before steps:<p>1. (For Nvidia GPU users) Install cuda toolkit <a href="https:&#x2F;&#x2F;developer.nvidia.com&#x2F;cuda-downloads" rel="nofollow noreferrer">https:&#x2F;&#x2F;developer.nvidia.com&#x2F;cuda-downloads</a><p>2. Download the model somewhere: <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;TheBloke&#x2F;Llama-2-13B-chat-GGML&#x2F;resolve&#x2F;main&#x2F;llama-2-13b-chat.ggmlv3.q4_0.bin" rel="nofollow noreferrer">https:&#x2F;&#x2F;huggingface.co&#x2F;TheBloke&#x2F;Llama-2-13B-chat-GGML&#x2F;resolv...</a><p>In Windows Terminal with Powershell:<p><pre><code> git clone https:&#x2F;&#x2F;github.com&#x2F;ggerganov&#x2F;llama.cpp cd llama.cpp mkdir build cd build cmake .. -DLLAMA_CUBLAS=ON cmake --build . --config Release cd bin&#x2F;Release mkdir models mv Folder\Where\You\Downloaded\The\Model .\models .\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin --color -p &quot;Hello, how are you, llama?&quot; 2&gt; $null </code></pre> `-DLLAMA_CUBLAS` uses cuda<p>`2&gt; $null` is to direct the debug messages printed to stderr to a null file so they don&#x27;t spam your terminal<p>Here&#x27;s a powershell function you can put in your $PSPROFILE so that you can just run prompts with `llama &quot;prompt goes here&quot;`:<p><pre><code> function llama { .\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin -p $args 2&gt; $null } </code></pre> adjust your paths as necessary. It has a tendency to talk to itself.
评论 #36872616 未加载
评论 #36876601 未加载
评论 #36884699 未加载
评论 #36885448 未加载
评论 #36879072 未加载
评论 #36884789 未加载
jawertyalmost 2 years ago
Some you may have seen this but I have a Llama 2 finetuning live coding stream from 2 days ago where I walk through some fundamentals (like RLHF and Lora) and how to fine-tune LLama 2 using PEFT&#x2F;Lora on a Google Colab A100 GPU.<p>In the end with quantization and parameter efficient fine-tuning it only took up 13gb on a single GPU.<p>Check it out here if you&#x27;re interested: <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=TYgtG2Th6fI">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=TYgtG2Th6fI</a>
评论 #36874781 未加载
评论 #36871977 未加载
评论 #36872184 未加载
评论 #36874189 未加载
andreykalmost 2 years ago
This covers three things: Llama.cpp (Mac&#x2F;Windows&#x2F;Linux), Ollama (Mac), MLC LLM (iOS&#x2F;Android)<p>Which is not really comprehensive... If you have a linux machine with GPUs, i&#x27;d just use hugging face inference (<a href="https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;text-generation-inference">https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;text-generation-inference</a>). And I am sure there are other things that could be covered.
评论 #36875538 未加载
评论 #36868601 未加载
评论 #36867871 未加载
评论 #36870646 未加载
评论 #36868360 未加载
krychualmost 2 years ago
Self-plug. Here’s a fork of the original llama 2 code adapted to run on the CPU or MPS (M1&#x2F;M2 GPU) if available:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;krychu&#x2F;llama">https:&#x2F;&#x2F;github.com&#x2F;krychu&#x2F;llama</a><p>It runs with the original weights, and gets you to ~4 tokens&#x2F;sec on MacBook Pro M1 with the 7B model.
thisisitalmost 2 years ago
The easiest way I found was to use GPT4All. Just download and install, grab GGML version of Llama 2, copy to the models directory in the installation folder. Fire up GPT4All and run.
rootusrootusalmost 2 years ago
For most people who just want to play around and are using MacOS or Windows, I&#x27;d just recommend lmstudio.ai. Nice interface, with super easy searching and downloading of new models.
评论 #36870919 未加载
评论 #36872446 未加载
评论 #36871196 未加载
评论 #36868525 未加载
评论 #36870944 未加载
Der_Einzigealmost 2 years ago
The correct answer, as always, is the oogabooga text generation webUI, which supports all of the relevant backends: <a href="https:&#x2F;&#x2F;github.com&#x2F;oobabooga&#x2F;text-generation-webui">https:&#x2F;&#x2F;github.com&#x2F;oobabooga&#x2F;text-generation-webui</a>
评论 #36869583 未加载
aledalgrandealmost 2 years ago
Don&#x27;t remember if the grammar has been merged in llama.cpp yet but it would be the first step to have Llama + Stable diffusion locally to output text + images and talk to each other. The only part I&#x27;m not sure is how Llama would interpret images back. At least it could use them though, to build e.g. a webpage.
评论 #36872153 未加载
guy98238710almost 2 years ago
&gt; curl -L &quot;<a href="https:&#x2F;&#x2F;replicate.fyi&#x2F;install-llama-cpp" rel="nofollow noreferrer">https:&#x2F;&#x2F;replicate.fyi&#x2F;install-llama-cpp</a>&quot; | bash<p>Seriously? Pipe script from someone&#x27;s website directly to bash?
评论 #36870042 未加载
评论 #36870796 未加载
评论 #36870165 未加载
评论 #36871882 未加载
评论 #36872790 未加载
评论 #36871157 未加载
jossclimbalmost 2 years ago
Seems to be a better guide here (without the risk curl):<p><a href="https:&#x2F;&#x2F;www.stacklok.com&#x2F;post&#x2F;exploring-llama-2-on-a-apple-mac-m1-m2" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.stacklok.com&#x2F;post&#x2F;exploring-llama-2-on-a-apple-m...</a>
ericHosickalmost 2 years ago
The LLM is impressive (llama2:13b) but appears to have been greatly limited to what you are allowed to do with it.<p>I tried to get it to generate a JSON object about the movie The Matrix and the model refuses.
评论 #36872732 未加载
oaththrowawayalmost 2 years ago
Off topic: is there a way to use one of the LLMs and have it ingest data from a SQLite database and ask it questions about it?
评论 #36870367 未加载
评论 #36869768 未加载
评论 #36870691 未加载
评论 #36869791 未加载
评论 #36869623 未加载
maxlinalmost 2 years ago
I might be missing something. The article asks me to run a bash script on windows.<p>I assume this would still need to be run manually to access GPU resources etc, so can someone illuminate what is actually expected for a windows user to make this run?<p>I&#x27;m currently paying 15$ a month in a personal translation&#x2F;summarizer project&#x27;s ChatGPT queries. I run whisper (const.me&#x27;s GPU fork) locally and would love to get the LLM part local eventually too! The system generates 30k queries a month but is not super-affected by delay so lower token rates might work too.
评论 #36871080 未加载
评论 #36870585 未加载
评论 #36871543 未加载
评论 #36871736 未加载
nonethewiseralmost 2 years ago
Maybe obvious to others, but the 1 line install command with curl is taking a long time. Must be the build step. Probably 40+ minutes now on an M2 max.
评论 #36871925 未加载
nravicalmost 2 years ago
Self plug: run llama.cpp as an inference server on a spot instance anywhere: <a href="https:&#x2F;&#x2F;cedana.readthedocs.io&#x2F;en&#x2F;latest&#x2F;examples.html#running-llama-cpp-inference" rel="nofollow noreferrer">https:&#x2F;&#x2F;cedana.readthedocs.io&#x2F;en&#x2F;latest&#x2F;examples.html#runnin...</a>
评论 #36872467 未加载
TheAceOfHeartsalmost 2 years ago
How do you decide what model variant to use? There&#x27;s a bunch of Quant method variations of Llama-2-13B-chat-GGML [0], how do you know which one to use? Reading the &quot;Explanation of the new k-quant methods&quot; is a bit opaque.<p>[0] <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;TheBloke&#x2F;Llama-2-13B-chat-GGML" rel="nofollow noreferrer">https:&#x2F;&#x2F;huggingface.co&#x2F;TheBloke&#x2F;Llama-2-13B-chat-GGML</a>
评论 #36871110 未加载
评论 #36875236 未加载
prohoboalmost 2 years ago
The thing I get peeved by is that none of the models say how much RAM&#x2F;VRAM they need to run. Just list minimum specs please!
sva_almost 2 years ago
If you just want to do inference&#x2F;mess around with the model and have a 16GB GPU, then this[0] is enough to paste into a notebook. You need to have access to the HF models though.<p>0. <a href="https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;blog&#x2F;blob&#x2F;main&#x2F;llama2.md#using-transformers">https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;blog&#x2F;blob&#x2F;main&#x2F;llama2.md#usin...</a>
handelaaralmost 2 years ago
Idiot question: if I have access to sentence-by-sentence professionally-translated text of foreign-language-to-English in gigantic quantities, and I fed the originals as prompts and the translations as completions...<p>... would I be likely to get anything useful if I then fed it new prompts in a similar style? Or would it just generate gibberish?
评论 #36869708 未加载
评论 #36870754 未加载
alvincodesalmost 2 years ago
I appreciate their honesty when it&#x27;s in their interest that people use their API rather than run it locally.
评论 #36871627 未加载
nomandalmost 2 years ago
Is it possible for such local install to retain conversation history so if for example you&#x27;re working on a project and use it as your assistance across many days that you can continue conversations and for the model to keep track of what you and it already know?
评论 #36869801 未加载
评论 #36869113 未加载
评论 #36869043 未加载
synaesthesisxalmost 2 years ago
This is usable, but hopefully folks manage to tweak it a bit further for even higher tokens&#x2F;s. I’m running Llama.cpp locally on my M2 Max (32 GB) with decent performance but sticking to the 7B model for now.
boffinAudioalmost 2 years ago
I need some hand-holding .. I have a directory of over 80,000 PDF files. How do I train Llama2 on this directory and start asking questions about the material - is this even feasible?
评论 #36892067 未加载
RicoElectricoalmost 2 years ago
<p><pre><code> curl -L &quot;https:&#x2F;&#x2F;replicate.fyi&#x2F;windows-install-llama-cpp&quot; </code></pre> ... returns 404 Not Found
评论 #36871079 未加载
评论 #36871112 未加载
theLiminatoralmost 2 years ago
Is it possible to do hybrid inference if I have a 24GB card with the 70B model? Ie. Offload some of it to my RAM?
ameliusalmost 2 years ago
As someone with too little spare time I&#x27;m curious, what are people using this for, except research?
technologicalalmost 2 years ago
Did anyone build pc for running these models and which one do you recommend
评论 #36874191 未加载
TastyAmphibianalmost 2 years ago
I&#x27;m still curious to know the hype behind Llama 2
politelemonalmost 2 years ago
Llama.cpp can run on Android too.