TechEcho

13 comments

All these answers are good, but I can share more concrete numbers…Meta released their OPT model which they claim is comparable to the GPT-3 model. Guidance for running that model [1] suggests a LOT of memory - at least 350GB of gpu memory which is roughly 4 A1000s, which are pricy.Running this on AWS with the above suggestion would cost $25/hr - just for one model running. That’s almost $0.50 a minute. If you imagine it takes a few seconds to run the model for one request… easily you’ll hit $0.05 per request once you factor in the rest of the infra (storage, CDN, etc) and the engineering cost, and the research cost, and the fact that they probably have a scale to hundreds of instances for heavy traffic and that may mean less efficient purchased servers.OpenAI has a sweetheart deal with Azure, but this is roughly the cost structure for serving requests. And this doesn’t include the upfront cost of training.<a href="https://alpa.ai/tutorials/opt_serving.html" rel="nofollow">https://alpa.ai/tutorials/opt_serving.html</a>

评论 #34392889 未加载

评论 #34399606 未加载

评论 #34398687 未加载

评论 #34401377 未加载

sdrg822over 2 years ago

For things like BERT where you just want to extract an embedding, the naive way you reach full utilization at inference time is that you :- run tokenization of inputs on CPU- sort inputs by length- batch inputs of similar length and apply padding to make of uniform length- pass the batches through so a single model can process many inputs in parallel.For GPT-style decoder models however, this becomes much more challenging because inference requires a forward pass for every token generated. (Stopping criteria also may differ but that’s another tangent).Every generated token performs attention on every previous token, both the context (or “prompt”) and the previously generated tokens (important for self consistency). this is a quadratic operation in the vanilla case.Model sizes are large , often spanning multiple machines, and the information for later layers depends on previous ones, meaning inference has to be pipelined.The naive approach would be to have a single transaction processed exclusively by a single instance of the model. this is expensive! even if each model can be crammed into a single A100 , if you want to run something like Codex or ChatGPT for millions of users with low latency inference, you’d have to have thousands of GPUs preloaded with models, and each transaction would take a highly variable amount of time.If a model spans multiple machines, you’d achieve a max of 1/n% utilization because each shard has to remain loaded while the others process, and then if you want to do pipeline parallelism like in pipe dream, you’d have to deal with attention caches since you don’t want to have to recompute every previous state each time

JoeyBananasover 2 years ago

> Does that estimate include amortization of upfront development costs?The answer is almost certainly "no." A service like Chat GPT is expensive because it requires heavy-duty GPU computations.

Jenssonover 2 years ago

The gpu required to run it (A100) is said to cost about $150k. If each query is said to cost about 3 cents, then that means the card could execute the model about 5 million times before it makes profit. Maybe a bit more if we include the electricity bill, and even more if Microsoft charges extra for the service since they want to make profit.I don't think these numbers sounds very out of line. It would be easier to understand the feasibility of this if we knew how fast those cards could execute the model. If it takes a second to run it then a few cents seems about right, if it takes a few milliseconds then it is a lot less than a few cents unless Microsoft charges huge premium for the servers.

评论 #34391382 未加载

toomuchtodoover 2 years ago

<a href="https://twitter.com/tomgoldsteincs/status/1600196981955100694" rel="nofollow">https://twitter.com/tomgoldsteincs/status/160019698195510069...</a><a href="https://threadreaderapp.com/thread/1600196981955100694.html" rel="nofollow">https://threadreaderapp.com/thread/1600196981955100694.html</a>

jhoelzelover 2 years ago

Because its a language model and really does not query information but assumes relationships with them. Meaning that "the words" have to be encoded and brought into relation by text.Now there are different ways to achieve this, but in essence because it has to know everything all at once plus instructions on how to handle that.You can actually ask it to explain to you how you could create a natural language processing algorithm yourself and it will even give you a starter framework in the language of your choice. But a fair warning, for me it was like a 6 hour deep rabbit hole :D

sinenomineover 2 years ago

The model is large and every instance likely (not sure about the absolute degree they optimized the model) requires several GPUs (or high-grade accelerators) to run at a moderate speed.Read the papers.

ilakshover 2 years ago

Apparently each query requires hundreds of GBs of GPU RAM on several expensive accelerator cards.Is the H100 deployed at Azure? I wonder how much more efficient that would be over A100s.

评论 #34399491 未加载

lee101over 2 years ago

Basically gpu/compute costs being so expensive. Probably just the chat cost itself. also a whole boat load of Development costs will eventually be passed on to consumers, for a cheaper alternative try <a href="https://text-generator.io" rel="nofollow">https://text-generator.io</a> It also analyses images which OpenAI doesn't do

scarface74over 2 years ago

My question is how long will it be before the average high end computer can run it? How long before your average smart phone?Memory shipped with computers have been stagnate for a decade

评论 #34395158 未加载

评论 #34395593 未加载

thealch3m1stover 2 years ago

Could models like chatGPT run on hardware like the Tesla Dojo ? If so maybe Elon should donate some...

评论 #34393203 未加载

评论 #34396817 未加载

DoesntMatter22over 2 years ago

I think it's actually quite cheap for what it is

评论 #34396829 未加载

z3r0k00lover 2 years ago

python

评论 #34391863 未加载

评论 #34391708 未加载

13 comments

vineyardmikeover 2 years ago

评论 #34392889 未加载

评论 #34399606 未加载

评论 #34398687 未加载

评论 #34401377 未加载

sdrg822over 2 years ago

JoeyBananasover 2 years ago

> Does that estimate include amortization of upfront development costs?The answer is almost certainly "no." A service like Chat GPT is expensive because it requires heavy-duty GPU computations.

Jenssonover 2 years ago

评论 #34391382 未加载

toomuchtodoover 2 years ago

jhoelzelover 2 years ago

sinenomineover 2 years ago

ilakshover 2 years ago

Apparently each query requires hundreds of GBs of GPU RAM on several expensive accelerator cards.Is the H100 deployed at Azure? I wonder how much more efficient that would be over A100s.

评论 #34399491 未加载

lee101over 2 years ago

scarface74over 2 years ago

My question is how long will it be before the average high end computer can run it? How long before your average smart phone?Memory shipped with computers have been stagnate for a decade

评论 #34395158 未加载

评论 #34395593 未加载

thealch3m1stover 2 years ago

Could models like chatGPT run on hardware like the Tesla Dojo ? If so maybe Elon should donate some...

Why is Chat GPT so expensive to operate?

13 comments

Why is Chat GPT so expensive to operate?

13 comments