TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Notes on training BERT from scratch on an 8GB consumer GPU

171 点作者 montebicyclelo将近 2 年前

11 条评论

dwrodri将近 2 年前
Super fascinating post. For those who go straight to the comments, it appears that someone managed to train a BERT model[1] to 90% of the GLUE score reported in the original BERT paper on a single GPU in ~100 hours. Note that this includes pre-training!<p>I can&#x27;t find a clear source on time and compute used for the original BERT pretraining run, but it&#x27;s clear than this is at least two orders of magnitude less hardware and roughly similar wall time.<p>I wonder how much of this could be translated over to the pretraining phase for a GPT?<p>I wonder if the SOPHIA[1] optimizer would also help here?<p>I&#x27;d argue that the research work being done to push these ML models into the realm of practicality on smaller hardware is just as important as the foundation that it relies on.<p>1: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2305.14342.pdf" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2305.14342.pdf</a>
评论 #36163008 未加载
评论 #36163209 未加载
评论 #36162724 未加载
GaggiX将近 2 年前
<a href="https:&#x2F;&#x2F;www.mosaicml.com&#x2F;blog&#x2F;mosaicbert" rel="nofollow">https:&#x2F;&#x2F;www.mosaicml.com&#x2F;blog&#x2F;mosaicbert</a><p>I think an useful article for people who want to train BERT from scratch with 20$ (in this case by renting GPUs).<p>This model has also an actual good GLUE score.
nologic01将近 2 年前
Conceptually there should be a predictable tradeoff curve between memory size and execution time. This could be quite useful (e.g. if 100hrs is ok, so are 200hrs), after all you don&#x27;t train such a model every day.<p>But in practice this curve is probably very non-linear in the low end. This writeup shows nicely how lumpy the various steps of loading kernels, model etc<p><a href="https:&#x2F;&#x2F;huggingface.co&#x2F;docs&#x2F;transformers&#x2F;perf_train_gpu_one" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;docs&#x2F;transformers&#x2F;perf_train_gpu_one</a>
评论 #36164710 未加载
Havoc将近 2 年前
Nice to see movement in the 8GB space. Not as sexy as the bigger stuff but still matters. As reminder these are steam hardware survey stats:<p>6gb - 19%<p>8gb - 28%<p>12gb - 11%<p>rest, mostly &lt;6gb<p>So staying below 8 helps open thing up to a lot of tinkerers in a sorta democratic sense.<p><a href="https:&#x2F;&#x2F;store.steampowered.com&#x2F;hwsurvey&#x2F;Steam-Hardware-Software-Survey-Welcome-to-Steam" rel="nofollow">https:&#x2F;&#x2F;store.steampowered.com&#x2F;hwsurvey&#x2F;Steam-Hardware-Softw...</a>
评论 #36163478 未加载
d4rkp4ttern将近 2 年前
Impressive feat. Honest question though — What are the reasons to pay attention to BERT, when there is GPT4 for various use cases? Does it come down to cost, latency and privacy (vs using OpenAI API)?<p>Genuinely curious, if I am missing a compelling use case outside of those reasons.
评论 #36163928 未加载
评论 #36164476 未加载
curiousgal将近 2 年前
I am sure a lot of work (and time) went into this but what&#x27;s the point? If the model is not as good and the original then the applications are limited and the training time is moot.
评论 #36163424 未加载
评论 #36163724 未加载
bcatanzaro将近 2 年前
Does anyone know whether this training run used FP16 or BF16? This GPU has tensor cores which dramatically accelerate DL - if they are being used. The post doesn’t mention it.
sireat将近 2 年前
This is impressive indeed.<p>I trained BERT from scratch on Colab&#x27;s K80s spending a few hours (not 100h as in the article) on a much smaller corpus.<p>My results were understandably rather horrible.
zakki将近 2 年前
If we do the training from scratch what happened when power is down in the middle of the training? Is the the training should be restarted from the beginning?
评论 #36162816 未加载
ChuckNorris89将近 2 年前
Should specify Nvidia GPU. AMD is notoriously absent from all this ML progress.
评论 #36163025 未加载
评论 #36163219 未加载
2-718-281-828将近 2 年前
i&#x27;ve always been curious about looking at the statistics of comments&#x2F;points of hn posts. so this is up 10h, has 57 points and not a single comment so far. fascinating.
评论 #36162390 未加载
评论 #36162921 未加载
评论 #36163490 未加载