Altman has said "it's a few cents per chat", which probably means it closer to high single digit cents per chat. Does that estimate include amortization of upfront development costs, or is it actually the marginal cost of a chat?
All these answers are good, but I can share more concrete numbers…<p>Meta released their OPT model which they claim is comparable to the GPT-3 model. Guidance for running that model [1] suggests a LOT of memory - at least 350GB of <i>gpu memory</i> which is roughly 4 A1000s, which are pricy.<p>Running this on AWS with the above suggestion would cost $25/hr - just for one model running. That’s almost $0.50 a minute. If you imagine it takes a few seconds to run the model for one request… easily you’ll hit $0.05 per request once you factor in the rest of the infra (storage, CDN, etc) and the engineering cost, and the research cost, and the fact that they probably have a scale to hundreds of instances for heavy traffic and that may mean less efficient purchased servers.<p>OpenAI has a sweetheart deal with Azure, but this is roughly the cost structure for serving requests. And this doesn’t include the upfront cost of training.<p><a href="https://alpa.ai/tutorials/opt_serving.html" rel="nofollow">https://alpa.ai/tutorials/opt_serving.html</a>
For things like BERT where you just want to extract an embedding, the naive way you reach full utilization at inference time is that you :<p>- run tokenization of inputs on CPU<p>- sort inputs by length<p>- batch inputs of similar length and apply padding to make of uniform length<p>- pass the batches through so a single model can process many inputs in parallel.<p>For GPT-style decoder models however, this becomes much more challenging because inference requires a forward pass for every token generated. (Stopping criteria also may differ but that’s another tangent).<p>Every generated token performs attention on every previous token, both the context (or “prompt”) and the previously generated tokens (important for self consistency). this is a quadratic operation in the vanilla case.<p>Model sizes are large , often spanning multiple machines, and the information for later layers depends on previous ones, meaning inference has to be pipelined.<p>The naive approach would be to have a single transaction processed exclusively by a single instance of the model. this is expensive! even if each model can be crammed into a single A100 , if you want to run something like Codex or ChatGPT for millions of users with low latency inference, you’d have to have thousands of GPUs preloaded with models, and each transaction would take a highly variable amount of time.<p>If a model spans multiple machines, you’d achieve a max of 1/n% utilization because each shard has to remain loaded while the others process, and then if you want to do pipeline parallelism like in pipe dream, you’d have to deal with attention caches since you don’t want to have to recompute every previous state each time
> Does that estimate include amortization of upfront development costs?<p>The answer is almost certainly "no." A service like Chat GPT is expensive because it requires heavy-duty GPU computations.
The gpu required to run it (A100) is said to cost about $150k. If each query is said to cost about 3 cents, then that means the card could execute the model about 5 million times before it makes profit. Maybe a bit more if we include the electricity bill, and even more if Microsoft charges extra for the service since they want to make profit.<p>I don't think these numbers sounds very out of line. It would be easier to understand the feasibility of this if we knew how fast those cards could execute the model. If it takes a second to run it then a few cents seems about right, if it takes a few milliseconds then it is a lot less than a few cents unless Microsoft charges huge premium for the servers.
Because its a language model and really does not query information but assumes relationships with them. Meaning that "the words" have to be encoded and brought into relation by text.<p>Now there are different ways to achieve this, but in essence because it has to know everything all at once plus instructions on how to handle that.<p>You can actually ask it to explain to you how you could create a natural language processing algorithm yourself and it will even give you a starter framework in the language of your choice. But a fair warning, for me it was like a 6 hour deep rabbit hole :D
The model is large and every instance likely (not sure about the absolute degree they optimized the model) requires several GPUs (or high-grade accelerators) to run at a moderate speed.<p>Read the papers.
Apparently each query requires hundreds of GBs of GPU RAM on several expensive accelerator cards.<p>Is the H100 deployed at Azure? I wonder how much more efficient that would be over A100s.
Basically gpu/compute costs being so expensive.
Probably just the chat cost itself. also a whole boat load of Development costs will eventually be passed on to consumers, for a cheaper alternative try <a href="https://text-generator.io" rel="nofollow">https://text-generator.io</a>
It also analyses images which OpenAI doesn't do
My question is how long will it be before the average high end computer can run it? How long before your average smart phone?<p>Memory shipped with computers have been stagnate for a decade