I love the full stack deep learning crew, and took their course several years ago in Berkeley. I highly recommend it.<p>One thing that always blows my mind is how much it is just not worth it to train LLMs in the cloud if you're a startup (and probably even less so for really large companies). Compared to 36 month reserved pricing the break even point was 8 months if you bought the hardware and rented out some racks at a colo, and that includes the on hands support. Having the dedicated hardware also meant that researchers were willing to experiment more when we weren't doing a planned training job, as it wouldn't pull from our budget. We spent a sizable chuck of our raise on that cluster but it was worth every penny.<p>I will say that I would not put customer facing inference on prem at this point- the resiliency of the cloud normally offsets the pricing, and most inference can be done with cheaper hardware than training. For training though you can get away with a weaker SLA, and the cloud is always there if you really need to burst beyond what you've purchased.
My experience with many of these services renting mostly A100s:<p>LambdaLabs: For on-demand instances, they are the cheapest available option. Their offering is straightforward, and I've never had a problem. The downside is that their instance availability is spotty. It seems like things have gotten a little better in the last month, and 8x machines are available more often than not, but single A100s were rarely available for most of this year. Another downside is lack of persistent storage, meaning you have to transfer your data every time you start a new instance. They have some persistent storage in beta, but it's effectively useless since it's only in one region and there's no instances in that region that I've seen.<p>Jarvis: Didn't work for me when I tried them a couple months ago. The instances would never finish booting. It's also a pre-paid system, so you have to fill up your "balance" before renting machines. But their customer service was friendly and gave me a full refund so <i>shrug</i>.<p>GCP: This is my go-to so far. A100s are $1.1/hr interruptible, and of course you get all the other Google offerings like persistent disks, S3, managed SQL, container registry, etc. Availability of interruptible instances has been consistently quite good, if a bit confusing. I've had some machines up for a week solid without interruption, while other times I can tear down a stack of machines and immediately request a new one only to be told they are out of availability. The downsides are the usual GCP downsides: poor documentation, sometimes weird glitches, and perhaps the worst billing system I've seen outside of the healthcare industry.<p>Vast.ai: They can be a good chunk cheaper, but at the cost of privacy, security, support, and reliability. Pre-load only. For certain workloads and if you're highly cost sensitive this is a good option to consider.<p>RunPod: Terrible performance issues. Pre-load only. Non-responsive customer support. I ended up having to get my credit card company involved.<p>Self-hosted: As a sibling comment points out, self hosting is a great option to consider. In particular "Having the dedicated hardware also meant that researchers were willing to experiment more". I've got a couple cards in my lab that I use for experimentation, and then throw to the cloud for big runs.
May also suggest the suite of open source DeepView tools and which are part of PyPi. Profile and predict your specific model training performance on a variety of GPUs. I wrote a linked in post with usage GIF here: <a href="https://www.linkedin.com/posts/activity-7057419660312371200-hdb5?utm_source=share&utm_medium=member_desktop" rel="nofollow">https://www.linkedin.com/posts/activity-7057419660312371200-...</a><p>And links to PyPi:<p>=> <a href="https://pypi.org/project/deepview-profile/" rel="nofollow">https://pypi.org/project/deepview-profile/</a>
=> <a href="https://pypi.org/project/deepview-predict/" rel="nofollow">https://pypi.org/project/deepview-predict/</a><p>And you can actually do it in browser for several foundational models (more to come):
=> <a href="https://centml.ai/calculator/" rel="nofollow">https://centml.ai/calculator/</a><p>Note: I have personal interests in this startup.
"Those god damn AWS charges" -Silicon Valley.
Might as well build your own GPU farm. Some of these cards, used you can probably get for 6K (guestimating).
Doesn't anybody use TPUs from Google?<p>Given the heterogeneous nature of GPUs, RAM, tensor cores, etc. it would be nice to have a direct comparison of, say, number of teraflops-hour per dollar, or something like that.
Lambda has a very interesting benchmark page
<a href="https://lambdalabs.com/gpu-benchmarks" rel="nofollow">https://lambdalabs.com/gpu-benchmarks</a><p>If you look through the throughput/$ metric, the V100 16GB looks like a great deal, followed by H100 80GB PCIe 5. For most benchmarks, the A100 looks worse in comparison
FWIW, Paperspace has a similar GPU comparison guide located here <a href="https://www.paperspace.com/gpu-cloud-comparison" rel="nofollow">https://www.paperspace.com/gpu-cloud-comparison</a><p>Disclosure: I work on Paperspace
why does no one rent out AMD GPUs?<p>I know those cards are second class citizens in the world of deep learning, but they have had (experimental) pytorch support for a while now, where are the offerings?
A comprehensive list of GPU options and pricing from cloud vendors. Very useful if you're looking to train or deploy large machine learning/deep learning models.
Looking to run a cloud instance of Stable Diffusion for personal experimentation. Looking at cloud mostly because I don't have a GPU or desktop hardware at home, and my Mac M1 is too slow. But also needing to contend with constant switching on/off the instance several times a week to use it.<p>Wondering which vendors other HN'ers are using to achieve this?
I built a slightly less detailed version of this, which also also lists free credits: <a href="https://cloud-gpus.com/" rel="nofollow">https://cloud-gpus.com/</a><p>Open to any feedback/suggestions! Will be adding 4090/H100 shortly.
Out of curiosity. Do mostly use/want one GPU or the full server with all GPUs (8x A100 80GB, or 16x A100 40GB but I think only Google Cloud has those)? or a mix?
Missing Oracle Cloud which has a massive GPU footprint - <a href="https://www.oracle.com/cloud/compute/gpu/" rel="nofollow">https://www.oracle.com/cloud/compute/gpu/</a>