TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: How to guide on training Llama-405B using PyTorch distributed APIs

3 点作者 lambda-research7 个月前
Been working on a guide for ML folks to upgrade their single GPU training code to multi GPU and multi node. Code diffs and explanations are included.<p>The guide builds up to this final chapter (linked) on how to train a very large model like Llama 3.1 405B on a big cluster with plain pytorch.<p>Everything is just written using the direct pytorch apis (other than the model code which is just using `transformers` models).<p>If there are topics of interest feel free to open an issue in the repo, and contributions are welcome.<p>I&#x27;m investigating adding a chapter on tensor parallelism, but it&#x27;s support in pytorch is still early stages.

2 条评论

lostmsu7 个月前
The guide does not say how efficient this run was in terms of GPU utilization (tops&#x2F;theoretical max tops).
评论 #41869804 未加载
lambda-research7 个月前
Let me know if there are any questions or suggestions!<p>Feel free to open issue on github, and contributions are welcome also