In the span of a few months, with a small team of researchers and engineers, we trained a 70B parameter model from scratch on our own infrastructure that outperformed zero-shot GPT-4o on reasoning-related tasks. Using our cluster for high performance training meant that every component — InfiniBand, Ethernet, GPUs, and the nodes themselves — had to work perfectly. If even a single one of the over 12,000 connections was a little flaky, it could slow down the entire training run.<p>We're sharing open-source scripts and an end-to-end guide for infrastructure set-up that details the process of making everything work perfectly, and ensuring that it stays that way.<p>This is one of a three-part toolkit on training a 70b model from scratch. The other two sections focus on evaluations and CARBS, our hyperparameter optimizer; you can find them here: <a href="https://imbue.com/research/70b-intro/" rel="nofollow">https://imbue.com/research/70b-intro/</a><p>Thoughts and questions welcome! :)
> This post focuses on one cluster that had 4,092 H100 GPUs spread across 511 computers, with eight GPUs to a computer<p>Am I right in understanding, that's over $100 Million worth of GPUs?<p>I wonder what/when/if any of this will be within the realms of an enthusiast with a gaming-pc budget.
This is hella cool. Cisco has a new nvidia collab with 800G per-port. I don’t recall if it was RoCE or not. The infiniband is accessible by the GPUs here? Beautiful.<p>Thank you for sharing all this. One of the more directly useful posts.
This was discussed on the Latent Space podcast a few days ago: <a href="https://www.latent.space/p/llm-training-2024" rel="nofollow">https://www.latent.space/p/llm-training-2024</a><p>That was a good episode, worth listening to for hearing justifications behind some of these decisions.
I am fascinated by the total electrical power drawn to build models - power and cooling I guess. Do you have any numbers on that (the point being Zuckerberg in a podcast suggested the next 1GW model was being planned - basically a data centre with a mid sized power plant attached)
This is such a valuable piece.
I've learned so much reading it! And your open-source code is great as well.<p>Some open questions I have:
1) Why did you choose to setup your own cluster? How was the experience with your cloud partner regarding faulty machines / switches?
2) What were your considerations choosing the cluster architecture that have proven the most valuable ? (apart from the all2all comms)
3) Can you share a bit more about your logging infra apart from the fact that it was Loki based?
4) What necessitated the use of a local docker registry? did you use other images apart from nvidia-container-runtime?<p>Thanks!
Honest question: why is there so much PC hardware in the mix here? Why don't we have PCI + infiniband backends with GPUs and a little tiny orchestrating ARM controller and just let them all coordinate with each other? Is it just "momentum" from previous designs and/or lack of "market" for specialized GPU controllers?
4,092 H100 GPUs.<p>They’re working on “self-coding”.
No-code or minimal code solutions or?<p>Quite a few articles and such people may be interested in also on their website:
<a href="https://imbue.com/our-work/" rel="nofollow">https://imbue.com/our-work/</a>
I wonder if it's possible for a huge number of hobbyists to team up and train a model together in a distributed manner like seti@home or folding@home. Or does this kind of workload not really lend itself to that approach?<p>Those things were of course characterised by the ability to spread the work into pretty self-contained work packages. Not sure if that can be done with model training.
It would be quite interesting to see the same hardware used to repeat the training, but with raw Unicode, instead of tokenized training data.<p>I'd like to see the difference in performance on spelling and rhymes.