What's new in Llama 2 and how to run it locally

176 pointsby andrewonalmost 2 years ago

16 comments

visargaalmost 2 years ago

In my tests LLaMa2-13B is useable for information extraction tasks and LLaMA2-70B is almost as good as GPT-4 (for IE). These models are the real thing. We can fine-tune LLaMAs, unlike OpenAI's models. Now we can have privacy, control and lower prices. We can introduce guidance, KV caching and other tricks to improve the models.The enthusiasm around it reminds me of JavaScript framework wars of 10 years ago - tons of people innovating and debating approaches, lots of projects popping up, so much energy!

评论 #37031116 未加载

评论 #37035385 未加载

评论 #37032493 未加载

评论 #37031102 未加载

评论 #37035088 未加载

评论 #37035114 未加载

评论 #37031139 未加载

kordlessagainalmost 2 years ago

I've been evaluating running non-quantized models on a Google Cloud instance with various GPUs.To run a `vllm` backed Llama 2 7b model[1], start a Debian 11 spot instance, with (1) Nvidia L4 and a g2-standard-8 w/100GB of SSD disk (ignoring the advice to use a Cuda installer image):<pre><code> sudo apt-get update -y sudo apt-get install build-essential -y sudo apt-get install linux-headers-$(uname -r) -y wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run sudo sh cuda_11.8.0_520.61.05_linux.run # ~5 minutes, install defaults, type 'accept'/return sudo apt-get install python3-pip -y sudo pip install --upgrade huggingface_hub # skip using token as git credential huggingface-cli login (for Meta model access paste token from HF[2]) sudo pip install vllm # ~8 minutes </code></pre> Then, edit the test code for a 7b Llama 2 model (paste into llama.py):<pre><code> from vllm import LLM llm = LLM(model="meta-llama/Llama-2-7b-hf") output = llm.generate("The capital of Brazil is called") print(output) </code></pre> Spot price for this deployment is ~$225/month. The instance will eventually be terminated by Google, so plan accordingly.[1] <a href="https://vllm.readthedocs.io/en/latest/models/supported_models.html" rel="nofollow noreferrer">https://vllm.readthedocs.io/en/latest/models/supported_model...</a> [2] <a href="https://huggingface.co/settings/tokens" rel="nofollow noreferrer">https://huggingface.co/settings/tokens</a>

评论 #37043442 未加载

jurmousalmost 2 years ago

Did anybody try the Llama 2 model with languages other than English? The paper notes that it works best with English and the amount of training data for other languages is only a fraction. Which likely would make it unusable for me..See table 10 (page 22) of the whitepaper for the numbers: <a href="https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/" rel="nofollow noreferrer">https://ai.meta.com/research/publications/llama-2-open-found...</a>Are there other downloadable models which can be used in a multilingual environment that people here are aware of?

评论 #37031780 未加载

评论 #37032468 未加载

评论 #37032242 未加载

jmorganalmost 2 years ago

If you're looking to run Llama 2 locally via a CLI or REST API (vs the web ui this article highlights), there's an open-source project some folks and I have been working on over the last few weeks: <a href="https://github.com/jmorganca/ollama">https://github.com/jmorganca/ollama</a>More projects in this space:- llama.cpp which is a fast, low level runner (with bindings in several languages)- llm by Simon Willison which supports different backends and has a really elegant CLI interface- The MLC.ai and Apache TVM projectsPrevious discussion on HN that might be helpful from an article by the great folks at replicate: <a href="https://news.ycombinator.com/item?id=36865495">https://news.ycombinator.com/item?id=36865495</a>

评论 #37033796 未加载

simonwalmost 2 years ago

If you want to try Llama 2 on a Mac and have Homebrew (or Python/pip) you may find my LLM CLI tool interesting: <a href="https://simonwillison.net/2023/Aug/1/llama-2-mac/" rel="nofollow noreferrer">https://simonwillison.net/2023/Aug/1/llama-2-mac/</a>

评论 #37032971 未加载

评论 #37030858 未加载

jawertyalmost 2 years ago

If you’re someone who wants to fine-tune Llama 2 on Google Colab, I have a couple live coding streams I did this past week where I fine tune Llama on my own datasetHere’s the stream - <a href="https://www.youtube.com/live/LitybCiLhSc?feature=share">https://www.youtube.com/live/LitybCiLhSc?feature=share</a>One is with LoRa and the other QLoRa I also do a breakdown on each fine-tuning method. I wanted to make these since I myself have had issues running LLMs locally and Colab is the cheapest GPU I can find haha.

评论 #37034677 未加载

SOLAR_FIELDSalmost 2 years ago

So I tried getting Longchat running (a 32k context llama 2 7b model released a few days ago) with FastChat and I was able to successfully get it running. It seems what I was trying to use it for (Langchain SQL agent) it is not good enough out of the box. Part of this is that I think Langchain is kind of biased towards OpenAi’s models and perhaps Llamaindex would perform better. However Llamaindex uses a newer version of sqlalchemy that a bunch of data warehouse clients don’t support yet.Unfortunately with all of the hype it seems that unless you have a REALLY beefy machine the better 70B model feels out of reach for most to run locally leaving the 7B and 13B as the only viable options outside of some quantization trickery. Or am I wrong in that?I want to focus more on larger context windows since it seems like RAG has a lot of promise so it seems like the 7B with giant context window is the best path to explore rather than focusing on getting the 70B to work locally

评论 #37034012 未加载

评论 #37034052 未加载

caromalmost 2 years ago

Just set things up locally last night. If you're a developer, llama.cpp was a pleasure to build and run. I wanted to run the weights from Meta and couldn't figure out text generation web ui. It seemed that one was optimized for grabbing something off HuggingFace.Running on a 3090. The 13b chat model quantized to fp8 is giving about 42 tok/s.

growtalmost 2 years ago

Is there an overview somewhere how much RAM is needed for which model? Is it possible at all to run 4bit 70B on CPU and RAM?

评论 #37031958 未加载

ktaubealmost 2 years ago

What's the cheapest way to run e.g. LLaMa2-13B and have it served as an API?I've tried Inference Endpoints and Replicate, but both would cost more than just using the OpenAI offering.

评论 #37032509 未加载

评论 #37033077 未加载

评论 #37034920 未加载

评论 #37034348 未加载

MediumOwlalmost 2 years ago

There's only mention of Nvidia GPUs on the web site, what about AMD?

评论 #37035777 未加载

评论 #37033894 未加载

brucethemoose2almost 2 years ago

I am partial to Koboldcpp over text gen UI for a number of reasons....But I am also a bit out of the loop. For instance, I have not kept up with the CFG/negative prompt or grammar implementations in the UIs.

gorenbalmost 2 years ago

I've only used the 13b model and I'd say it was as good as GPT-3 (not GPT-4). It's amazing, and I only have a laptop to run it locally on so 13b is as good as I can do.

Manidosalmost 2 years ago

One way to connect llama-2 (cpp) to a node.js app is by using this helper class (stdin) <a href="https://gist.github.com/HackyDev/814c6d1c96f259a13dbf5b2dabf98e8f" rel="nofollow noreferrer">https://gist.github.com/HackyDev/814c6d1c96f259a13dbf5b2dabf...</a>

评论 #37033254 未加载

bigcloud1299almost 2 years ago

Does anyone have info on how to run 70B on windows ? :) would appreciate it.

KaoruAoiShihoalmost 2 years ago

What prompts did you use for the article's decorative art?