TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Infrastructure setup and open-source scripts to train 70B model from bare metal

325 pointsby thejash11 months ago

12 comments

thejash11 months ago
In the span of a few months, with a small team of researchers and engineers, we trained a 70B parameter model from scratch on our own infrastructure that outperformed zero-shot GPT-4o on reasoning-related tasks. Using our cluster for high performance training meant that every component — InfiniBand, Ethernet, GPUs, and the nodes themselves — had to work perfectly. If even a single one of the over 12,000 connections was a little flaky, it could slow down the entire training run.<p>We&#x27;re sharing open-source scripts and an end-to-end guide for infrastructure set-up that details the process of making everything work perfectly, and ensuring that it stays that way.<p>This is one of a three-part toolkit on training a 70b model from scratch. The other two sections focus on evaluations and CARBS, our hyperparameter optimizer; you can find them here: <a href="https:&#x2F;&#x2F;imbue.com&#x2F;research&#x2F;70b-intro&#x2F;" rel="nofollow">https:&#x2F;&#x2F;imbue.com&#x2F;research&#x2F;70b-intro&#x2F;</a><p>Thoughts and questions welcome! :)
评论 #40822123 未加载
评论 #40818501 未加载
评论 #40820884 未加载
评论 #40818606 未加载
评论 #40819253 未加载
alias_neo11 months ago
&gt; This post focuses on one cluster that had 4,092 H100 GPUs spread across 511 computers, with eight GPUs to a computer<p>Am I right in understanding, that&#x27;s over $100 Million worth of GPUs?<p>I wonder what&#x2F;when&#x2F;if any of this will be within the realms of an enthusiast with a gaming-pc budget.
评论 #40941977 未加载
评论 #40820822 未加载
评论 #40827253 未加载
renewiltord11 months ago
This is hella cool. Cisco has a new nvidia collab with 800G per-port. I don’t recall if it was RoCE or not. The infiniband is accessible by the GPUs here? Beautiful.<p>Thank you for sharing all this. One of the more directly useful posts.
loudmax11 months ago
This was discussed on the Latent Space podcast a few days ago: <a href="https:&#x2F;&#x2F;www.latent.space&#x2F;p&#x2F;llm-training-2024" rel="nofollow">https:&#x2F;&#x2F;www.latent.space&#x2F;p&#x2F;llm-training-2024</a><p>That was a good episode, worth listening to for hearing justifications behind some of these decisions.
评论 #40827774 未加载
lifeisstillgood11 months ago
I am fascinated by the total electrical power drawn to build models - power and cooling I guess. Do you have any numbers on that (the point being Zuckerberg in a podcast suggested the next 1GW model was being planned - basically a data centre with a mid sized power plant attached)
omerhac11 months ago
This is such a valuable piece. I&#x27;ve learned so much reading it! And your open-source code is great as well.<p>Some open questions I have: 1) Why did you choose to setup your own cluster? How was the experience with your cloud partner regarding faulty machines &#x2F; switches? 2) What were your considerations choosing the cluster architecture that have proven the most valuable ? (apart from the all2all comms) 3) Can you share a bit more about your logging infra apart from the fact that it was Loki based? 4) What necessitated the use of a local docker registry? did you use other images apart from nvidia-container-runtime?<p>Thanks!
mmastrac11 months ago
Honest question: why is there so much PC hardware in the mix here? Why don&#x27;t we have PCI + infiniband backends with GPUs and a little tiny orchestrating ARM controller and just let them all coordinate with each other? Is it just &quot;momentum&quot; from previous designs and&#x2F;or lack of &quot;market&quot; for specialized GPU controllers?
评论 #40822612 未加载
评论 #40823986 未加载
评论 #40840469 未加载
instagib11 months ago
4,092 H100 GPUs.<p>They’re working on “self-coding”. No-code or minimal code solutions or?<p>Quite a few articles and such people may be interested in also on their website: <a href="https:&#x2F;&#x2F;imbue.com&#x2F;our-work&#x2F;" rel="nofollow">https:&#x2F;&#x2F;imbue.com&#x2F;our-work&#x2F;</a>
weinzierl11 months ago
How much did it cost? Overall, from nothing to the usable model files, in hardware cost, development hours and ultimately electricity and cooling?
wkat424211 months ago
I wonder if it&#x27;s possible for a huge number of hobbyists to team up and train a model together in a distributed manner like seti@home or folding@home. Or does this kind of workload not really lend itself to that approach?<p>Those things were of course characterised by the ability to spread the work into pretty self-contained work packages. Not sure if that can be done with model training.
评论 #40852234 未加载
john2x11 months ago
once the model is trained, what happens to the hardware and infrastructure?
评论 #40818534 未加载
评论 #40818079 未加载
评论 #40818247 未加载
mikewarot11 months ago
It would be quite interesting to see the same hardware used to repeat the training, but with raw Unicode, instead of tokenized training data.<p>I&#x27;d like to see the difference in performance on spelling and rhymes.