科技回声

11 条评论

dwrodri将近 2 年前

Super fascinating post. For those who go straight to the comments, it appears that someone managed to train a BERT model[1] to 90% of the GLUE score reported in the original BERT paper on a single GPU in ~100 hours. Note that this includes pre-training!I can't find a clear source on time and compute used for the original BERT pretraining run, but it's clear than this is at least two orders of magnitude less hardware and roughly similar wall time.I wonder how much of this could be translated over to the pretraining phase for a GPT?I wonder if the SOPHIA[1] optimizer would also help here?I'd argue that the research work being done to push these ML models into the realm of practicality on smaller hardware is just as important as the foundation that it relies on.1: <a href="https://arxiv.org/pdf/2305.14342.pdf" rel="nofollow">https://arxiv.org/pdf/2305.14342.pdf</a>

评论 #36163008 未加载

评论 #36163209 未加载

评论 #36162724 未加载

GaggiX将近 2 年前

<a href="https://www.mosaicml.com/blog/mosaicbert" rel="nofollow">https://www.mosaicml.com/blog/mosaicbert</a>I think an useful article for people who want to train BERT from scratch with 20$ (in this case by renting GPUs).This model has also an actual good GLUE score.

nologic01将近 2 年前

Conceptually there should be a predictable tradeoff curve between memory size and execution time. This could be quite useful (e.g. if 100hrs is ok, so are 200hrs), after all you don't train such a model every day.But in practice this curve is probably very non-linear in the low end. This writeup shows nicely how lumpy the various steps of loading kernels, model etc<a href="https://huggingface.co/docs/transformers/perf_train_gpu_one" rel="nofollow">https://huggingface.co/docs/transformers/perf_train_gpu_one</a>

评论 #36164710 未加载

Havoc将近 2 年前

Nice to see movement in the 8GB space. Not as sexy as the bigger stuff but still matters. As reminder these are steam hardware survey stats:6gb - 19%8gb - 28%12gb - 11%rest, mostly <6gbSo staying below 8 helps open thing up to a lot of tinkerers in a sorta democratic sense.<a href="https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam" rel="nofollow">https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...</a>

评论 #36163478 未加载

d4rkp4ttern将近 2 年前

Impressive feat. Honest question though — What are the reasons to pay attention to BERT, when there is GPT4 for various use cases? Does it come down to cost, latency and privacy (vs using OpenAI API)?Genuinely curious, if I am missing a compelling use case outside of those reasons.

评论 #36163928 未加载

评论 #36164476 未加载

curiousgal将近 2 年前

I am sure a lot of work (and time) went into this but what's the point? If the model is not as good and the original then the applications are limited and the training time is moot.

评论 #36163424 未加载

评论 #36163724 未加载

bcatanzaro将近 2 年前

Does anyone know whether this training run used FP16 or BF16? This GPU has tensor cores which dramatically accelerate DL - if they are being used. The post doesn’t mention it.

sireat将近 2 年前

This is impressive indeed.I trained BERT from scratch on Colab's K80s spending a few hours (not 100h as in the article) on a much smaller corpus.My results were understandably rather horrible.

zakki将近 2 年前

If we do the training from scratch what happened when power is down in the middle of the training? Is the the training should be restarted from the beginning?

评论 #36162816 未加载

ChuckNorris89将近 2 年前

Should specify Nvidia GPU. AMD is notoriously absent from all this ML progress.

评论 #36163025 未加载

评论 #36163219 未加载

2-718-281-828将近 2 年前

i've always been curious about looking at the statistics of comments/points of hn posts. so this is up 10h, has 57 points and not a single comment so far. fascinating.

评论 #36162390 未加载

评论 #36162921 未加载

评论 #36163490 未加载

11 条评论

dwrodri将近 2 年前

评论 #36163008 未加载

评论 #36163209 未加载

评论 #36162724 未加载

GaggiX将近 2 年前

nologic01将近 2 年前

评论 #36164710 未加载

Havoc将近 2 年前

评论 #36163478 未加载

d4rkp4ttern将近 2 年前

评论 #36163928 未加载

评论 #36164476 未加载

curiousgal将近 2 年前

I am sure a lot of work (and time) went into this but what's the point? If the model is not as good and the original then the applications are limited and the training time is moot.

评论 #36163424 未加载

评论 #36163724 未加载

bcatanzaro将近 2 年前

Does anyone know whether this training run used FP16 or BF16? This GPU has tensor cores which dramatically accelerate DL - if they are being used. The post doesn’t mention it.

sireat将近 2 年前

This is impressive indeed.I trained BERT from scratch on Colab's K80s spending a few hours (not 100h as in the article) on a much smaller corpus.My results were understandably rather horrible.

zakki将近 2 年前

If we do the training from scratch what happened when power is down in the middle of the training? Is the the training should be restarted from the beginning?

评论 #36162816 未加载

ChuckNorris89将近 2 年前

Should specify Nvidia GPU. AMD is notoriously absent from all this ML progress.

评论 #36163025 未加载

评论 #36163219 未加载

2-718-281-828将近 2 年前

i've always been curious about looking at the statistics of comments/points of hn posts. so this is up 10h, has 57 points and not a single comment so far. fascinating.

评论 #36162390 未加载

评论 #36162921 未加载

评论 #36163490 未加载

Notes on training BERT from scratch on an 8GB consumer GPU

11 条评论

Notes on training BERT from scratch on an 8GB consumer GPU

11 条评论