Run Llama 13B with a 6GB graphics card

618 点作者 rain1大约 2 年前

28 条评论

On my system, using `-ngl 22` (running 22 layers on the GPU) cuts wall clock time by ~60%.My system:GPU: NVidia RTX 2070S (8GB VRAM)CPU: AMD Ryzen 5 3600 (16GB VRAM)Here's the performance difference I see:CPU only (./main -t 12)<pre><code> llama_print_timings: load time = 15459.43 ms llama_print_timings: sample time = 23.64 ms / 38 runs ( 0.62 ms per token) llama_print_timings: prompt eval time = 9338.10 ms / 356 tokens ( 26.23 ms per token) llama_print_timings: eval time = 31700.73 ms / 37 runs ( 856.78 ms per token) llama_print_timings: total time = 47192.68 ms </code></pre> GPU (./main -t 12 -ngl 22)<pre><code> llama_print_timings: load time = 10285.15 ms llama_print_timings: sample time = 21.60 ms / 35 runs ( 0.62 ms per token) llama_print_timings: prompt eval time = 3889.65 ms / 356 tokens ( 10.93 ms per token) llama_print_timings: eval time = 8126.90 ms / 34 runs ( 239.03 ms per token) llama_print_timings: total time = 18441.22 ms</code></pre>

评论 #35940204 未加载

评论 #35940518 未加载

评论 #36064421 未加载

评论 #35940393 未加载

评论 #35942018 未加载

评论 #35939969 未加载

评论 #35939830 未加载

naillo大约 2 年前

This is cool but are people actually getting stuff done with these models? I'm enthusiastic about their potential too but after playing with it for a day I'm at a loss for what to use it for anymore at this point

评论 #35938193 未加载

评论 #35940364 未加载

评论 #35941476 未加载

评论 #35938173 未加载

评论 #35937842 未加载

评论 #35938034 未加载

评论 #35938236 未加载

评论 #35937870 未加载

评论 #35939434 未加载

评论 #35937914 未加载

评论 #35942725 未加载

评论 #35938680 未加载

holoduke大约 2 年前

Why does AMD or Intel not release a medium performant GPU with minimum 128gb of memory for a good consumer price. These models require lots of memory to 'single' pass an operation. Throughput could be bit slower. A 1080 Nvidia with 256gb of memory would run all these models fast right? Or am I forgetting something here.

评论 #35940751 未加载

评论 #35940863 未加载

评论 #35939664 未加载

评论 #35939609 未加载

评论 #35939885 未加载

评论 #35941621 未加载

评论 #35940745 未加载

评论 #35939640 未加载

评论 #35941090 未加载

评论 #35940772 未加载

评论 #35940443 未加载

评论 #35939895 未加载

peatmoss大约 2 年前

From skimming, it looks like this approach requires CUDA and thus is Nvidia only.Anyone have a recommended guide for AMD / Intel GPUs? I gather the 4 bit quantization is the special sauce for CUDA, but I’d guess there’d be something comparable for not-CUDA?

评论 #35938160 未加载

评论 #35938245 未加载

marcopicentini大约 2 年前

What do you use to host these models (like Vicuna, Dolly etc) on your own server and expose them using HTTP REST API? Is there an Heroku-like for LLM models?I am looking for an open source models to do text summarization. Open AI is too expensive for my use case because I need to pass lots of tokens.

评论 #35938250 未加载

评论 #35938170 未加载

评论 #35938806 未加载

评论 #35942214 未加载

syntaxing大约 2 年前

This update is pretty exciting, I’m gonna try running a large model (65B) with a 3090. I have ran a ton of local LLM but the hardest part is finding out the prompt structure. I wish there is some sort of centralized data base that explains it.

评论 #35939568 未加载

评论 #35939792 未加载

tikkun大约 2 年前

See also:<a href="https://www.reddit.com/r/LocalLLaMA/comments/13fnyah/you_guys_are_missing_out_on_gpt4x_vicuna/" rel="nofollow">https://www.reddit.com/r/LocalLLaMA/comments/13fnyah/you_guy...</a><a href="https://chat.lmsys.org/?arena" rel="nofollow">https://chat.lmsys.org/?arena</a> (Click 'leaderboard')

Ambix大约 2 年前

No need to convert models, 4bit LLaMA versions for GGML v2 available here:<a href="https://huggingface.co/gotzmann/LLaMA-GGML-v2/tree/main" rel="nofollow">https://huggingface.co/gotzmann/LLaMA-GGML-v2/tree/main</a>

mozillas大约 2 年前

I ran the 7B Vicuna (ggml-vic7b-q4_0.bin) on a 2017 MacBook Air (8GB RAM) with llama.cpp.Worked OK for me with the default context size. 2048, like you see in most examples was too slow for my taste.

评论 #35938331 未加载

yawnxyz大约 2 年前

Could someone please share a good resource for building a machine from scratch, for doing simple-ish training and running open-source models like Llama? I'd love to run some of these and even train them from scratch, and I'd love to use that as an excuse to drop $5k on a new machine...Would love to run a bunch of models on the machine without dripping $$ to OpenAI, Modal or other providers...

评论 #35943704 未加载

评论 #35944153 未加载

rahimnathwani大约 2 年前

PSA:If you're using oobabooga/text-generation-webui then you need to:1. Re-install llama-cpp-python with support for CUBLAS:<pre><code> CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir --force-reinstall </code></pre> 2. Launch the web UI with the --n-gpu-layers flag, e.g.<pre><code> python server.py --model gpt4-x-vicuna-13B.ggml.q5_1.bin --n-gpu-layers 24</code></pre>

sroussey大约 2 年前

I wish this used the webgpu c++ library instead, then it could be used in any GPU hardware.

hhh大约 2 年前

Instructions are a bit rough. The Micromamba thing doesn’t work, doesn’t say how to install it… you have to clone llama.cpp too

评论 #35937940 未加载

tarr11大约 2 年前

What is the state of the art on evaluating the accuracy of these models? Is there some equivalent to an “end to end test”?It feels somewhat recursive since the input and output are natural language and so you would need another LLM to evaluate whether the model answered a prompt correctly.

评论 #35939220 未加载

评论 #35939015 未加载

bitL大约 2 年前

How about reloading parts of the model as the inference progresses instead of splitting it into GPU/CPU parts? Reloading would be memory-limited to the largest intermediate tensor cut.

评论 #35938207 未加载

评论 #35938093 未加载

akulbe大约 2 年前

I've only ever been a consumer of ChatGPT/Bard. Never set up any LLM stuff locally, but the idea is appealing to me.I have a ThinkStation P620 w/ThreadRipper Pro 3945WX (12c24t) with a GTX 1070 (and a second 1070 I could put in there) and there's 512GB of RAM on the box.Does this need to be bare metal, or can it run in VM?I'm currently running RHEL 9.2 w/KVM (as a VM host) with light usage so far.

评论 #36064214 未加载

评论 #35946346 未加载

qwertox大约 2 年前

If I really want to do some playing around in this area, would it be good to get a RTX 4000 SFF which has 20 GB of VRAM but is a low-power card, which I want as it would be running 24/7 and energy prices are pretty bad in Germany, or would it make more sense to buy an Apple product with some M2 chip which apparently is good for these tasks as it shares CPU and GPU memory?

ranger_danger大约 2 年前

Why can't these models run on the GPU while also using CPU RAM for the storage? That way people will performant-but-memory-starved GPUs can still utilize the better performance of the GPU calculation while also having enough RAM to store the model? I know it is possible to provide system RAM-backed GPU objects.

anshumankmr大约 2 年前

How long before it runs on a 4 gig card?

评论 #35938955 未加载

MuffinFlavored大约 2 年前

How many "B" (billions of parameters) is ChatGPT GPT-4?

评论 #35942660 未加载

BlackLotus89大约 2 年前

This only uses llama correct? So the output should be the same as if you were only using llama.cpp. Am I the only one who doesn't get nearly the same quality of output using a quantized model compared to GPU? Some models I tried get astounding results when running on a GPU, but create only "garbage" when running on a CPU. Even when not quantized down to 4bit llama.cpp just doesn't compare for me. Am I alone with this?

dclowd9901大约 2 年前

Has anyone tried running encryption algorithms through these models? I wonder if it could be trained to decrypt.

评论 #35940141 未加载

dinobones大约 2 年前

What is HN’s fascination with these toy models that produce low quality, completely unusable output?Is there a use case for them I’m missing?Additionally, don’t they all have fairly restrictive licenses?

评论 #35940090 未加载

评论 #35940566 未加载

评论 #35940099 未加载

blendergeek大约 2 年前

Is there a way to run any of these with only 4GB of VRAM?

评论 #35951050 未加载

alg_fun大约 2 年前

wouldn't i be faster to use ram as a swap for vram?

avereveard大约 2 年前

or like download oobabooga/text-generation-webui, any prequantized variant, and be done.

s_dev大约 2 年前

[deleted]

评论 #35937826 未加载

评论 #35937900 未加载

评论 #35937973 未加载

ACV001大约 2 年前

The future is this - these models will be able to run on smaller and smaller hardware eventually being able to run on your phone, watch or embedded devices. The revolution is here and is inevitable. Similar to how computers evolved. We are still lucky that these models have no consciousness, still. Once they gain consciousness, that will mark the appearance of a new species (superior to us if anything). Also, luckily, they have no physical bodies and cannot replicate, so far...

评论 #35945581 未加载

评论 #35940578 未加载

28 条评论

rahimnathwani大约 2 年前

评论 #35940204 未加载

评论 #35940518 未加载

评论 #36064421 未加载

评论 #35940393 未加载

评论 #35942018 未加载

评论 #35939969 未加载

评论 #35939830 未加载

naillo大约 2 年前

评论 #35938193 未加载

评论 #35940364 未加载

评论 #35941476 未加载

评论 #35938173 未加载

评论 #35937842 未加载

评论 #35938034 未加载

评论 #35938236 未加载

评论 #35937870 未加载

评论 #35939434 未加载

评论 #35937914 未加载

评论 #35942725 未加载

评论 #35938680 未加载

holoduke大约 2 年前

评论 #35940751 未加载

评论 #35940863 未加载

评论 #35939664 未加载

评论 #35939609 未加载

评论 #35939885 未加载

评论 #35941621 未加载

评论 #35940745 未加载

评论 #35939640 未加载

评论 #35941090 未加载

评论 #35940772 未加载

评论 #35940443 未加载

评论 #35939895 未加载

peatmoss大约 2 年前

评论 #35938160 未加载

评论 #35938245 未加载

marcopicentini大约 2 年前

评论 #35938250 未加载

评论 #35938170 未加载

评论 #35938806 未加载

评论 #35942214 未加载

syntaxing大约 2 年前

评论 #35939568 未加载

评论 #35939792 未加载

tikkun大约 2 年前

Ambix大约 2 年前

mozillas大约 2 年前

评论 #35938331 未加载

yawnxyz大约 2 年前

评论 #35943704 未加载

评论 #35944153 未加载

rahimnathwani大约 2 年前

sroussey大约 2 年前

I wish this used the webgpu c++ library instead, then it could be used in any GPU hardware.

hhh大约 2 年前

Instructions are a bit rough. The Micromamba thing doesn’t work, doesn’t say how to install it… you have to clone llama.cpp too

评论 #35937940 未加载

tarr11大约 2 年前

评论 #35939220 未加载

评论 #35939015 未加载

bitL大约 2 年前

How about reloading parts of the model as the inference progresses instead of splitting it into GPU/CPU parts? Reloading would be memory-limited to the largest intermediate tensor cut.

评论 #35938207 未加载

评论 #35938093 未加载

akulbe大约 2 年前

评论 #36064214 未加载

评论 #35946346 未加载

qwertox大约 2 年前

ranger_danger大约 2 年前

anshumankmr大约 2 年前

How long before it runs on a 4 gig card?

评论 #35938955 未加载

MuffinFlavored大约 2 年前

How many "B" (billions of parameters) is ChatGPT GPT-4?

评论 #35942660 未加载

BlackLotus89大约 2 年前

dclowd9901大约 2 年前

Has anyone tried running encryption algorithms through these models? I wonder if it could be trained to decrypt.

评论 #35940141 未加载

dinobones大约 2 年前

评论 #35940090 未加载

评论 #35940566 未加载

评论 #35940099 未加载

blendergeek大约 2 年前

Is there a way to run any of these with only 4GB of VRAM?

评论 #35951050 未加载

alg_fun大约 2 年前

wouldn't i be faster to use ram as a swap for vram?

avereveard大约 2 年前

or like download oobabooga/text-generation-webui, any prequantized variant, and be done.