Hey folks! We're Alex and Evan, and we're working on putting together a 512 H100 compute cluster for startups and researchers to train large generative models on.
- it runs at the lowest possible margins (<$2.00/hr per H100)
- designed for bursty training runs, so you can take say 128 H100s for a week
- you don’t need to commit to multiple years of compute or pay for a year upfront<p>Big labs like OpenAI and Deepmind have big clusters that support this kind of bursty allocation for their researchers, but startups so far have had to get very small clusters on very long term contracts, wait months of lead time, and try to keep them busy all the time.<p>Our goal is to make it about 10-20x cheaper to do an AI startup than it is right now. Stable Diffusion only costs about $100k to train -- in theory every YC company could get up to that scale. It's just that no cloud provider in the world will give you $100k of compute for just a couple weeks, so startups have to raise 20x that much to buy a whole year of compute.<p>Once the cluster is online, we're going to be pretty much the only option for startups to do big training runs like that on.
I hope you succeed. TPU research cloud (TRC) tried this in 2019. It was how I got my start.<p>In 2023 you can barely get a single TPU for more than an hour. Back then you could get literally hundreds, with an s.<p>I believed in TRC. I thought they’d solve it by scaling, and building a whole continent of TPUs. But in the end, TPU time was cut short in favor of internal researchers — some researchers being more equal than others. And how could it be any other way? If I made a proposal today to get these H100s to train GPT to play chess, people would laugh. The world is different now.<p>Your project has a youthful optimism that I hope you won’t lose as you go. And in fact it might be the way to win in the long run. So whenever someone comes knocking, begging for a tiny slice of your H100s for their harebrained idea, I hope you’ll humor them. It’s the only reason I was able to become anybody.
> <i>Rather than each of K startups individually buying clusters of N gpus, together we buy a cluster with NK gpus... Then we set up a job scheduler to allocate compute</i><p>In theory, this sounds almost identical to the business model behind AWS, Azure, and other cloud providers. "Instead of everyone buying a fixed amount of hardware for individual use, we'll buy a massive pool of hardware that people can time-share." Outside of cloud providers having to mark up prices to give themselves a net-margin, is there something else they are failing to do, hence creating the need for these projects?
Having hosted infrastructure in CA at multiple colos. I would advise you to host it elsewhere if you can, cost of power, other infrastructure is much higher in CA than AZ or NV.
> It's just that no cloud provider in the world will give you $100k of compute for just a couple weeks<p>I've never had to buy very large compute, but I thought that was the whole point of the cloud
I am super interested in AI on a personal level and have been involved for a number of years.<p>I have never seen a GPU crunch quite like it is right now. To anyone who is interested in hobbyist ML, I highly highly recommend using vast.ai
I know AWS/GCP/Azure have overhead and I understand why so many companies choose to go bare metal on their ops. I personally rarely think it's worth the time and effort, but I get that with scale saving can be substantial.<p>But for AI training? If the public cloud isn't competitive even for bursty AI training, their margins are much higher than I anticipated.<p>OP mentions 10-20x cost reduction? Compared to what? AWS?
Hi, SF lover [1] here. Anything interesting to note about your name? Will your hardware actually be based in SF? Any plans to start meetups or bring customers together for socializing or anything like that?<p>[1] We have not gone the way of the Xerces blue [2] yet... we still exist!<p>[2] <a href="https://en.wikipedia.org/wiki/Xerces_blue" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Xerces_blue</a>
Noob Thought: So this would be a blue print on how a mid tier universities with older large compute cluster ops could do things in 2023 to support large LLM research?<p>Perhaps its also a way for freshly applying grad students to look at a university looking to do research in LLMs that requires scale...
Nat Friedman and Daniel Gross setup a 2,512 H100 cluster [1] for their startups, with a very similar “shared” model. Might be interesting to connect with them.<p>[1] <a href="https://andromedacluster.com/" rel="nofollow noreferrer">https://andromedacluster.com/</a>
What kind of hardware setup are you planning out? Colocation, roll-your-own data center, something in between? Any thoughts on what servers the GPUs will be housed in?
Honest question I don’t know how to consider: are we further along or behind with AI given crypto’s use of GPUs? Has the same cards bought for mining furthered AI, or maybe that demand lead to more research into GPUs and what they can do - or would we be further along if we weren’t wasting these cards on mining?
The billion dollar question is:<p>Who is funding this?<p>Cause if it’s VC then it’s going to have the same fate as everything else after 5-7 years.<p>I hope y’all have as innovative of a business model. You’ll need it if you want to do what you’re doing now for more than a few years
Please take this question without prejudice.<p>Is it accurate to say you’re willing to go into ~20,000,000 USD debt to sell discounted computer-as-a-service to researchers/startups, but unwilling to go into debt to sponsor the undergraduate degrees of ~100-500 students at top-tier schools? (40k - 200k USD per degree)<p>Or, you know, build and fund a small public school/library or two for ~5 years?