TechEcho

12 comments

thejash11 months ago

In the span of a few months, with a small team of researchers and engineers, we trained a 70B parameter model from scratch on our own infrastructure that outperformed zero-shot GPT-4o on reasoning-related tasks. Using our cluster for high performance training meant that every component — InfiniBand, Ethernet, GPUs, and the nodes themselves — had to work perfectly. If even a single one of the over 12,000 connections was a little flaky, it could slow down the entire training run.We're sharing open-source scripts and an end-to-end guide for infrastructure set-up that details the process of making everything work perfectly, and ensuring that it stays that way.This is one of a three-part toolkit on training a 70b model from scratch. The other two sections focus on evaluations and CARBS, our hyperparameter optimizer; you can find them here: <a href="https://imbue.com/research/70b-intro/" rel="nofollow">https://imbue.com/research/70b-intro/</a>Thoughts and questions welcome! :)

评论 #40822123 未加载

评论 #40818501 未加载

评论 #40820884 未加载

评论 #40818606 未加载

评论 #40819253 未加载

alias_neo11 months ago

> This post focuses on one cluster that had 4,092 H100 GPUs spread across 511 computers, with eight GPUs to a computerAm I right in understanding, that's over $100 Million worth of GPUs?I wonder what/when/if any of this will be within the realms of an enthusiast with a gaming-pc budget.

评论 #40941977 未加载

评论 #40820822 未加载

评论 #40827253 未加载

renewiltord11 months ago

This is hella cool. Cisco has a new nvidia collab with 800G per-port. I don’t recall if it was RoCE or not. The infiniband is accessible by the GPUs here? Beautiful.Thank you for sharing all this. One of the more directly useful posts.

loudmax11 months ago

This was discussed on the Latent Space podcast a few days ago: <a href="https://www.latent.space/p/llm-training-2024" rel="nofollow">https://www.latent.space/p/llm-training-2024</a>That was a good episode, worth listening to for hearing justifications behind some of these decisions.

评论 #40827774 未加载

lifeisstillgood11 months ago

I am fascinated by the total electrical power drawn to build models - power and cooling I guess. Do you have any numbers on that (the point being Zuckerberg in a podcast suggested the next 1GW model was being planned - basically a data centre with a mid sized power plant attached)

omerhac11 months ago

This is such a valuable piece. I've learned so much reading it! And your open-source code is great as well.Some open questions I have: 1) Why did you choose to setup your own cluster? How was the experience with your cloud partner regarding faulty machines / switches? 2) What were your considerations choosing the cluster architecture that have proven the most valuable ? (apart from the all2all comms) 3) Can you share a bit more about your logging infra apart from the fact that it was Loki based? 4) What necessitated the use of a local docker registry? did you use other images apart from nvidia-container-runtime?Thanks!

mmastrac11 months ago

Honest question: why is there so much PC hardware in the mix here? Why don't we have PCI + infiniband backends with GPUs and a little tiny orchestrating ARM controller and just let them all coordinate with each other? Is it just "momentum" from previous designs and/or lack of "market" for specialized GPU controllers?

评论 #40822612 未加载

评论 #40823986 未加载

评论 #40840469 未加载

instagib11 months ago

4,092 H100 GPUs.They’re working on “self-coding”. No-code or minimal code solutions or?Quite a few articles and such people may be interested in also on their website: <a href="https://imbue.com/our-work/" rel="nofollow">https://imbue.com/our-work/</a>

weinzierl11 months ago

How much did it cost? Overall, from nothing to the usable model files, in hardware cost, development hours and ultimately electricity and cooling?

wkat424211 months ago

I wonder if it's possible for a huge number of hobbyists to team up and train a model together in a distributed manner like seti@home or folding@home. Or does this kind of workload not really lend itself to that approach?Those things were of course characterised by the ability to spread the work into pretty self-contained work packages. Not sure if that can be done with model training.

评论 #40852234 未加载

john2x11 months ago

once the model is trained, what happens to the hardware and infrastructure?

评论 #40818534 未加载

评论 #40818079 未加载

评论 #40818247 未加载

mikewarot11 months ago

It would be quite interesting to see the same hardware used to repeat the training, but with raw Unicode, instead of tokenized training data.I'd like to see the difference in performance on spelling and rhymes.

12 comments

thejash11 months ago

评论 #40822123 未加载

评论 #40818501 未加载

评论 #40820884 未加载

评论 #40818606 未加载

评论 #40819253 未加载

alias_neo11 months ago

评论 #40941977 未加载

评论 #40820822 未加载

评论 #40827253 未加载

renewiltord11 months ago

loudmax11 months ago

评论 #40827774 未加载

lifeisstillgood11 months ago

omerhac11 months ago

mmastrac11 months ago

评论 #40822612 未加载

评论 #40823986 未加载

评论 #40840469 未加载

instagib11 months ago

weinzierl11 months ago

How much did it cost? Overall, from nothing to the usable model files, in hardware cost, development hours and ultimately electricity and cooling?

Infrastructure setup and open-source scripts to train 70B model from bare metal

12 comments

Infrastructure setup and open-source scripts to train 70B model from bare metal

12 comments