Cost of self hosting Llama-3 8B-Instruct

245 点作者 veryrealsid11 个月前

39 条评论

Instead of using AWS another approach involves self hosting the hardware as well. Even after factoring in energy, this does dramatically lower the price.Assuming we want to mirror our setup in AWS, we’d need 4x NVidia Tesla T4s. You can buy them for about $700 on eBay.Add in $1,000 to setup the rest of the rig and you have a final price of around:$2,800 + $1,000 = $3,800This whole exercise assumes that you're using the Llama 3 8b model. At full fp16 precision that will fit in one 3090 or 4090 GPU (the int8 version will too, and run faster, with very little degradation.) Especially if you're willing to buy GPU hardware from eBay, that will cost significantly less.I have my home workstation with a 4090 exposed as a vLLM service to an AWS environment where I access it via reverse SSH tunnel.

评论 #40684200 未加载

评论 #40683573 未加载

评论 #40683189 未加载

评论 #40683954 未加载

评论 #40684145 未加载

评论 #40682540 未加载

评论 #40682804 未加载

评论 #40686868 未加载

评论 #40682391 未加载

评论 #40682755 未加载

throwaway2016a11 个月前

Llama-3 is one of the models provided by AWS Bedrock which offers pay as you go pricing. I'm curious how it would break down on that.LLAMA 8B on Bedrock is $0.40 per 1M input tokens and $0.60 per 1M output tokens which is a lot cheaper than OpenAI models.Edit: to add to that, as technical people we tend to discount the value of our own time. Bedrock and the OpenAI are both very easy to integrate with and get started. How long did this server take to build? How much time does it take to maintain and make sure all the security patches are applied each month? How often does it crash and how much time will be needed to recover it? Do you keep spare parts on hand / how much is the cost of downtime if you have to wait to get a replacement part in the mail? That's got to be part of the break-even equation.

评论 #40682923 未加载

评论 #40682778 未加载

评论 #40682606 未加载

评论 #40683323 未加载

johnklos11 个月前

Self hosting means hosting it yourself, not running it on Amazon. I think the distinction the author intends to make is between running something that can't be hosted elsewhere, like ChatGPT, versus running Llama-3 yourself.Overlooking that, the rest of the article feels a bit strange. Would we really have a use case where we can make use of those 157 million tokens a month? Would we really round $50 of energy cost to $100 a month? (Granted, the author didn't include power for the computer) If we buy our own system to run, why would we need to "scale your own hardware"?I get that this is just to give us an idea of what running something yourself would cost when comparing with services like ChatGPT, but if so, we wouldn't be making most of the choices made here such as getting four NVIDIA Tesla T4 cards.Memory is cheap, so running Llama-3 entirely on CPU is also an option. It's slower, of course, but it's infinitely more flexible. If I really wanted to spend a lot of time tinkering with LLMs, I'd definitely do this to figure out what I want to run before deciding on GPU hardware, then I'd get GPU hardware that best matches that, instead of the other way around.

评论 #40683293 未加载

kiratp11 个月前

3 year commit pricing with Jetstream + Maxtext on TPU v5e is $0.25 per million tokens.On demand pricing put it at about $0.45 per million tokens.Source: We use TPUs at scale at <a href="https://osmos.io" rel="nofollow">https://osmos.io</a>Google Next 2024 session going into detail: <a href="https://www.youtube.com/watch?v=5QsM1K9ahtw" rel="nofollow">https://www.youtube.com/watch?v=5QsM1K9ahtw</a><a href="https://github.com/google/JetStream">https://github.com/google/JetStream</a><a href="https://github.com/google/maxtext">https://github.com/google/maxtext</a>

评论 #40683435 未加载

gradus_ad11 个月前

I wonder how long NVIDIA can justify its current market cap once people realize just how cheap it is to run inference on these models given that LLM performance is plateauing, LLM's as a whole are becoming commoditized, and compute demand for training will drop off a cliff sooner than people expect.

评论 #40682634 未加载

评论 #40682609 未加载

评论 #40682646 未加载

评论 #40682569 未加载

评论 #40683427 未加载

评论 #40682558 未加载

angoragoats11 个月前

Agreed with the sentiments here that this article gets a lot of the facts wrong, and I'll add one: the cost for electricity when self-hosting is dramatically lower than the article says. The math assumes that each of the Tesla T4s will be using their full TDP (70W each) 24 hours a day, 7 days a week. In reality, GPUs throttle down to a low power state when not in use. So unless you're conversing with your LLM literally 24 hours a day, it will be using dramatically less power. Even when actively doing inference, my GPU doesn't quite max out its power usage.Your self-hosted LLM box is going to use maybe 20-30% of the power this article suggests it will.Source: I run LLMs at home on a machine I built myself.

wesleyyue11 个月前

Surprised no comments are pointing out that the analysis is pretty far off simply due to the fact that the author runs with batch size of 1. The cost being 100x - 1000x what API providers are charging should be a hint that something is seriously off, even if you expect some of these APIs to be subsidized.

causal11 个月前

No way you need $3,800 to run an 8B model. 3090 and a basic rig is enough.That being said, the difference between OpenAI and AWS cost ($1 vs $17) is huge. Is OpenAI just operating at a massive loss?Edit: Turns out AWS is actually cheaper if you don't use the terrible setup in this article, see comments below.

评论 #40682501 未加载

评论 #40682521 未加载

评论 #40682520 未加载

评论 #40688431 未加载

forrest211 个月前

A single synchronous request is not a good way to understand cost here unless your workload is truly singular tiny requests. Chatgpt handles many requests in parallel and this article's 4 GPU setup certainly can handle more too.It is miraculous that the cost comparison isn't worse given how adversarial this test is.Larger requests, concurrent requests, and request queueing will drastically reduce cost here.

liquidise11 个月前

Great mix of napkin math and proper analysis, but what strikes me most is how cheap LLM access is. For it being relatively bleeding edge, us splitting hairs on < $20/M tokens is remarkable itself, and something tech people should be thrilled about.

评论 #40682546 未加载

throwup23811 个月前

The T4 is a six year old card. A much better comparison would be a 3090, 4090, A10, A100, etc.

Havoc11 个月前

>initial server cost of $3,800Not following?Llama 8B is like 17ish gigs. You can throw that onto a single 3090 off ebay. 700 for the card and another 500 for some 2nd hand basic gaming rig.Plus you don't need a 4 slot PCIE mobo. Plus it's a gen4 pcie card (vs gen3). Plus skipping the complexity of multi-GPU. And wouldn't be surprised if it ends up faster too (everything in one GPU tends to be much faster in my experience, plus 3090 is just organically faster 1:1)Or if you're feeling extra spicy you can do same on a 7900XTX (inference works fine on those & it's likely that there will be big optimisation gains in next months).

评论 #40683819 未加载

AaronFriel11 个月前

These costs don't line up with my own experiments using vLLM on EKS for hosting small to medium sized models. For small (under 10B parameters) models on g5 instances, with prefix caching and an agent style workload with only 1 or a small number of turns per request, I saw on the order of tens of thousands of tokens/second of prefill (due to my common system prompts) and around 900 tokens/second of output.I think this worked out to around $1/million tokens of output and orders of magnitude less for input tokens, and before reserved instances or other providers were considered.

评论 #40683236 未加载

xmonkee11 个月前

Does anyone know the impact of the prompt size in terms of throughput? If I'm only generating 10 tokens, does it matter if my initial prompt is 10 tokens or 8000 tokens? How much does it matter?

visarga11 个月前

I just bought a $1099 MacBook Air M3, I get about 10 tokens/s for a q5 quant. Doesn't even get hot, and I can take it with me on the plane. It's really easy to install ollama.

mark_l_watson11 个月前

Until January this year I mostly used Google Colab for both LLMs and deep learning projects. In January I spent about $1800 getting Apple Silicon M2Pro 32G. When I first got it, I was only so-so happy with the models I could run. Now I am ecstatically happy with the quality of the models I can run on this hardware.I sometimes use Groq Llama3 APIs (so fast!) or OpenAI APIs, but I mostly use my 32G M2 system.The article calculates cost of self-hosting, but I think it is also good taking into account how happy I am self hosting on my own hardware.

segmondy11 个月前

I own an 8 GPU cluster that I built for super cheap < $4,000. 180gb vram, 7 24gb + 1 24gb. There are tons of models that I run that's not hosted by any provider. The only way to run it is to host myself. Furthermore, the author has 39 tokens in 6 seconds. For llama3-8b, I get almost 80 tk/s and if parallel, can easily get up to 800 tk/s. Most users at home infer only one at a time because they are doing chat or role play. If you are doing more serious work, you will most likely have multiple inference running at once. When working with smaller models, it's not unusual to have 4-5 models loaded at once with multiple inference going. I have about 2tb of models downloaded, I don't have to shuffle data back and forth to the cloud, etc. To each their own, the author's argument is made today by many on why you should host in the cloud. Yet if you are not flush with cash and a little creative, it's far cheaper to run your own server than in the cloud.To run llama-3 8b. A new $300 3060 12gb will do, it will load fine in Q8 gguf. If you must load in fp16 and cash is a problem a $160 P40 will do. If performance is desired a used 3090 for ~$650 will do.

评论 #40684540 未加载

评论 #40684567 未加载

评论 #40686395 未加载

评论 #40686056 未加载

rfw30011 个月前

I agree with most of the criticisms here, and will add on one more: while it is generally true that you can’t beat “serverless” inference pricing for LLMs, production deployments often depend on fine-tuned models, for which these providers typically charge much more to host. That’s where the cost (and security, etc.) advantage for running on dedicated hardware comes in.

theogravity11 个月前

The energy costs in the bay area are double the reported 24c cost, so energy alone would be around $100-ish a month instead of $50-ish.

评论 #40683041 未加载

评论 #40682443 未加载

评论 #40682896 未加载

jezzarax11 个月前

llama.cpp + llama-3-8b in Q8 run great on a single T4 machine. Cannot remember the TPS I got there, but it was much above 6 mentioned in the article.

评论 #40682872 未加载

yousif_12312311 个月前

deepinfra.com hosts Llama 3 8b for 8 cents per 1m tokens. I'm not sure it's the cheapest but it's pretty cheap. There may be even cheaper options.(Haven't used it in production, thinking to use it for side projects).

winddude11 个月前

does aws not have lower vcpu and memory instances with multiple T4s? because with 192gbs of memory and 24 cores, you're paying for a ton of resources you won't be using if you're only running inference.

agcat11 个月前

This is a good way to do math. But honestly, how many products actually have 100% utilisation. I did some math a few months ago but mostly on the basis of active users, on what would be the % difference if you have 1k to 10K users/mo. You can run this as low as $0.3K/mo on Serverless GPUs and $0.7K/mo on EC2.The pricing is outdated now.Here is the piece -<a href="https://www.inferless.com/learn/unraveling-gpu-inference-costs-for-llms-openai-aws-and-inferless" rel="nofollow">https://www.inferless.com/learn/unraveling-gpu-inference-cos...</a>

michaelmior11 个月前

There's also the option of platforms such as BentoML (I have no affiliation) that offer usage-based pricing so you can at least take the 100% utilization assumption off the table. I'm not sure how the price compares to EKS.<a href="https://www.bentoml.com/" rel="nofollow">https://www.bentoml.com/</a>

baobabKoodaa11 个月前

If we care about cost efficiency when running LLMs, the most important things are:1. Don't use AWS, because it's one of the most expensive cloud providers2. Use quantized models, because they offer the best output quality per money spent, regardless of the budgetThis article, on the other hand, focuses exclusively on running an unquantized model on AWS...

melbourne_mat11 个月前

This is another one of those "I used this for 5 minutes and found this out" naive posts which add nothing useful.Check out the host LLM's at home crowd. One app to look at is llama.cpp. Model compression is one of the first techniques to successfully run models on low capacity hardware.

barbegal11 个月前

There's some dodgy maths>( 100 / 157,075,200 ) * 1,000,000 = $0.000000636637738Should be $0.64 so still expensive

评论 #40682543 未加载

badgersnake11 个月前

I’ve used llama3 on my work laptop with ollama. It wrote an amazing pop song about k-nearest neighbours in the style of PJ and Duncan’s ‘Let’s Get Ready to Rhumble’ called ‘Let’s Get Ready to Classify’ For everything else it’s next to useless.

vinni211 个月前

Ggml Q8 models on ollama can run on much cheaper hardware without losing much performance.

cheptsov11 个月前

With dstack you can either utilize multiple affordable cloud GPU providers at once to get the cheapest GPU offer or also use an own cluster of on-prem servers. Dstack supports both altogether. Disclaimer: I’m a core contributor to dstack

axegon_11 个月前

Up until not too long ago I assumed that self-hosting an llm would come at an outrageous cost. I have a bunch of problems with LLM's in general. The major one is that all LLMs(even openAI) will produce output which will give anyone a great sense of confidence, only to be later slapped across the face with the harsh reality-for anything involving serious reasoning, chances are the response you got was at large bullshit. The second one is that I do not entirely trust those companies with my data, be it OpenAI, Microsoft or Github or any other.That said, a while ago there was this[1] thread on here which helped me snatch a brand new, unboxed p40 for peanuts. Really, the cost was 2 or 3 jars of good quality peanut butter. Sadly it's still collecting dust since although my workstation can accommodate it, cooling is a bit of an issue - I 3D printed a bunch of hacky vents but I haven't had the time to put it all together.The reason why I went this road was phi-3, which blew me away by how powerful, yet compact it is. Again, I would not trust it with anything big, but I have been using it for sifting through a bunch of raw, unstructured text and extract data from it and it's honestly done wonders. Overall, depending on your budget and your goal, running an llm in your home lab is a very appealing idea.[1] <a href="https://news.ycombinator.com/item?id=39477848">https://news.ycombinator.com/item?id=39477848</a>

waldrews11 个月前

Hetzner GPU servers at $200/month for an RTX 4000 with 20GB seem competitive. Anyone have experience with what kind of token throughput you could get with that?

sgt10111 个月前

Running 13b code llama on my m1 macbook pro as I type this...

cloudking11 个月前

What do you use it for? What problems does it solve?

k__11 个月前

Half-OT: can I shard Llama3 and run it on multiple wasm processes?

yieldcrv11 个月前

this is not what I consider self hosting but okI would like to compare the costs vs hardware on prem, so this helps with one side of the equation

guluarte11 个月前

? you can run llama 3 8b with a 3060

jokethrowaway11 个月前

Yeah, or you can get a gpu server with 20GB VRAM on hetzner for ~200 EUR per month. Runpod and DigitalOcean are also quite competitive on prices if you need a different GPU.AWS is stupidly expensive.

评论 #40684970 未加载

ilaksh11 个月前

Kind of a ridiculous approach, especially for this model. Use together.ai, fireworks.ai, RunPod serverless, any serverless. Or use ollama with the default quantization, will work on many home computers, including my gaming laptop which is about 5 years old.