Cloud GPU Resources and Pricing

184 pointsby nihit-desaiabout 2 years ago

19 comments

tedivmabout 2 years ago

I love the full stack deep learning crew, and took their course several years ago in Berkeley. I highly recommend it.One thing that always blows my mind is how much it is just not worth it to train LLMs in the cloud if you're a startup (and probably even less so for really large companies). Compared to 36 month reserved pricing the break even point was 8 months if you bought the hardware and rented out some racks at a colo, and that includes the on hands support. Having the dedicated hardware also meant that researchers were willing to experiment more when we weren't doing a planned training job, as it wouldn't pull from our budget. We spent a sizable chuck of our raise on that cluster but it was worth every penny.I will say that I would not put customer facing inference on prem at this point- the resiliency of the cloud normally offsets the pricing, and most inference can be done with cheaper hardware than training. For training though you can get away with a weaker SLA, and the cloud is always there if you really need to burst beyond what you've purchased.

评论 #36025730 未加载

评论 #36025887 未加载

fpgaminerabout 2 years ago

My experience with many of these services renting mostly A100s:LambdaLabs: For on-demand instances, they are the cheapest available option. Their offering is straightforward, and I've never had a problem. The downside is that their instance availability is spotty. It seems like things have gotten a little better in the last month, and 8x machines are available more often than not, but single A100s were rarely available for most of this year. Another downside is lack of persistent storage, meaning you have to transfer your data every time you start a new instance. They have some persistent storage in beta, but it's effectively useless since it's only in one region and there's no instances in that region that I've seen.Jarvis: Didn't work for me when I tried them a couple months ago. The instances would never finish booting. It's also a pre-paid system, so you have to fill up your "balance" before renting machines. But their customer service was friendly and gave me a full refund so shrug.GCP: This is my go-to so far. A100s are $1.1/hr interruptible, and of course you get all the other Google offerings like persistent disks, S3, managed SQL, container registry, etc. Availability of interruptible instances has been consistently quite good, if a bit confusing. I've had some machines up for a week solid without interruption, while other times I can tear down a stack of machines and immediately request a new one only to be told they are out of availability. The downsides are the usual GCP downsides: poor documentation, sometimes weird glitches, and perhaps the worst billing system I've seen outside of the healthcare industry.Vast.ai: They can be a good chunk cheaper, but at the cost of privacy, security, support, and reliability. Pre-load only. For certain workloads and if you're highly cost sensitive this is a good option to consider.RunPod: Terrible performance issues. Pre-load only. Non-responsive customer support. I ended up having to get my credit card company involved.Self-hosted: As a sibling comment points out, self hosting is a great option to consider. In particular "Having the dedicated hardware also meant that researchers were willing to experiment more". I've got a couple cards in my lab that I use for experimentation, and then throw to the cloud for big runs.

评论 #36027092 未加载

pavelstoevabout 2 years ago

May also suggest the suite of open source DeepView tools and which are part of PyPi. Profile and predict your specific model training performance on a variety of GPUs. I wrote a linked in post with usage GIF here: <a href="https://www.linkedin.com/posts/activity-7057419660312371200-hdb5?utm_source=share&utm_medium=member_desktop" rel="nofollow">https://www.linkedin.com/posts/activity-7057419660312371200-...</a>And links to PyPi:=> <a href="https://pypi.org/project/deepview-profile/" rel="nofollow">https://pypi.org/project/deepview-profile/</a> => <a href="https://pypi.org/project/deepview-predict/" rel="nofollow">https://pypi.org/project/deepview-predict/</a>And you can actually do it in browser for several foundational models (more to come): => <a href="https://centml.ai/calculator/" rel="nofollow">https://centml.ai/calculator/</a>Note: I have personal interests in this startup.

stuckkeysabout 2 years ago

"Those god damn AWS charges" -Silicon Valley. Might as well build your own GPU farm. Some of these cards, used you can probably get for 6K (guestimating).

评论 #36026399 未加载

评论 #36027120 未加载

1024coreabout 2 years ago

Doesn't anybody use TPUs from Google?Given the heterogeneous nature of GPUs, RAM, tensor cores, etc. it would be nice to have a direct comparison of, say, number of teraflops-hour per dollar, or something like that.

TradingPlacesabout 2 years ago

Lambda has a very interesting benchmark page <a href="https://lambdalabs.com/gpu-benchmarks" rel="nofollow">https://lambdalabs.com/gpu-benchmarks</a>If you look through the throughput/$ metric, the V100 16GB looks like a great deal, followed by H100 80GB PCIe 5. For most benchmarks, the A100 looks worse in comparison

dkobranabout 2 years ago

FWIW, Paperspace has a similar GPU comparison guide located here <a href="https://www.paperspace.com/gpu-cloud-comparison" rel="nofollow">https://www.paperspace.com/gpu-cloud-comparison</a>Disclosure: I work on Paperspace

评论 #36025959 未加载

评论 #36025767 未加载

评论 #36027157 未加载

KeplerBoyabout 2 years ago

why does no one rent out AMD GPUs?I know those cards are second class citizens in the world of deep learning, but they have had (experimental) pytorch support for a while now, where are the offerings?

评论 #36025819 未加载

评论 #36025843 未加载

评论 #36025990 未加载

nihit-desaiabout 2 years ago

A comprehensive list of GPU options and pricing from cloud vendors. Very useful if you're looking to train or deploy large machine learning/deep learning models.

freediverabout 2 years ago

Created this a while ago for the same purpose <a href="https://cloudoptimizer.io" rel="nofollow">https://cloudoptimizer.io</a>

评论 #36048164 未加载

dotBenabout 2 years ago

Looking to run a cloud instance of Stable Diffusion for personal experimentation. Looking at cloud mostly because I don't have a GPU or desktop hardware at home, and my Mac M1 is too slow. But also needing to contend with constant switching on/off the instance several times a week to use it.Wondering which vendors other HN'ers are using to achieve this?

评论 #36028695 未加载

评论 #36034803 未加载

hcarlensabout 2 years ago

I built a slightly less detailed version of this, which also also lists free credits: <a href="https://cloud-gpus.com/" rel="nofollow">https://cloud-gpus.com/</a>Open to any feedback/suggestions! Will be adding 4090/H100 shortly.

andrewstuartabout 2 years ago

Go buy a GPU from the local computer store.Consumers GPUs are much more available, much cheaper and much faster.

评论 #36027317 未加载

评论 #36026101 未加载

rlupialmost 2 years ago

Out of curiosity. Do mostly use/want one GPU or the full server with all GPUs (8x A100 80GB, or 16x A100 40GB but I think only Google Cloud has those)? or a mix?

crucifictionabout 2 years ago

Missing Oracle Cloud which has a massive GPU footprint - <a href="https://www.oracle.com/cloud/compute/gpu/" rel="nofollow">https://www.oracle.com/cloud/compute/gpu/</a>

评论 #36028175 未加载

hislazinessabout 2 years ago

I do not see any of the AI processors like the Google TPU. Would they be cheaper?

评论 #36026426 未加载

aeturnumabout 2 years ago

This isn't my area of expertise so I can't get too deep but this is an extremely well organized and straightforward resource.

MacsHeadroomabout 2 years ago

It's interesting that AWS is a full 5x more expensive than the leading low cost providers, with Google close behind AWS.

IceHegelabout 2 years ago

Do we have numbers from H100

评论 #36027615 未加载