I did a quick comparison on MNIST with a small ConvNet, comparing this AdamWSCheduleFree optimizer against a few other optimizers (RAdam, NAdam, AdamW, SGD, Adam, Adafactor, SophiaG). The validation accuracy seems to be okay and the train loss decreases remarkably quickly.<p>Validation accuracy: <a href="https://i.imgur.com/8ZtX7Rd.png" rel="nofollow">https://i.imgur.com/8ZtX7Rd.png</a><p>Train loss: <a href="https://i.imgur.com/o5XdQ29.png" rel="nofollow">https://i.imgur.com/o5XdQ29.png</a><p>Code: <a href="https://bpa.st/NVJQ" rel="nofollow">https://bpa.st/NVJQ</a> (currently only runs on my computer, but not enough time to clean it up)<p>Note that this is just a toy benchmark with very little hyperparameter tuning. You could probably get similar results with most optimizers and an appropriate schedule. Nevertheless, I appreciate every hyperparameter that I do not have to set manually.<p>In summary, this seems to be a promising optimizer. I'll add it to my list of optimizers to try for new deep learning projects.
This is a pretty hyped-up optimizer that seems to have okay-ish performance in-practice, but there are a number of major red flags here. For one, the baselines are decently sandbagged, but the twitter posts sharing them (which are pretty hype-y) directly says that the baselines are "highly tuned" and that there's no benchmark trickery (which is flat-out wrong). If someone has not had experience with said benchmarks, it is a plausible statement, having worked with some these datasets very closely, some of the baselines are simply terrible, I don't know where they came from.<p>Additionally, the optimizer does actually appear to have a kind of momentum, despite claims directly saying the contrary, and uses it with a nesterov-like step (line 2 of 3 in the inner loop). Finally, it is 'schedule-free' because the schedule is actually hardcoded into the algorithm itself -- 1./steps_taken which is not necessarily a rare learning rate schedule. This is a decently robust but sometimes suboptimal schedule, and I find it sketchy to make claims that it is 'schedule-free'. This also cripples the optimizer by tying performance to the number of steps taken -- which is potentially a problem if you are using any batchsize+lr scaling strategies as I understand.<p>There is a mixture of hype and substance here, and I wish the author was more straightforward with their approach and claims. I think there is the potential for a good "bolts-included" optimizer with some of the ideas being presented here, but the amount of overhyping and deception makes me not want to trust any of the following work coming.<p>Unfortunately, hype is what sells best on Twitter, and some of the claims being made here appear to be at the very best deceptive, and at the very worst, untrue. I could be wrong -- these are just my personal opinions from my own experience, but I do occasionally find myself distraught about the things that tend to catch wind in the technical news cycle.<p>-Fern
And here I was hoping for something related to how to approach self-driven learning and education when you have a hectic and unpredictable schedule and are trying to fit learning in-between things with the fragments of time you have.
I was asking on Twitter if Aaron had any experiments for transformers, since they provided some graphs for CNNs and the like, but no transformers.<p>* Aaron et al's past work on D-Adaptation won a best ICML paper, with their follow up work being Prodigy - but both on transformers did similar or worse than AdamW. <a href="https://twitter.com/danielhanchen/status/1775547139248341125" rel="nofollow">https://twitter.com/danielhanchen/status/1775547139248341125</a><p>* Superconvergence + LR range finder + Fast AI's Ranger21 optimizer was the goto optimizer for CNNs, and worked fabulously well, but on transformers, the learning rate range finder sadi 1e-3 was the best, whilst 1e-5 was better. However, the 1 cycle learning rate stuck. <a href="https://github.com/huggingface/transformers/issues/16013">https://github.com/huggingface/transformers/issues/16013</a><p>* A huge issue is this needs tuning??! But how about a well tuned AdamW? Eg see <a href="https://twitter.com/kellerjordan0/status/1776716388037529843" rel="nofollow">https://twitter.com/kellerjordan0/status/1776716388037529843</a> which outperformed it using a tuned SGD.<p>* I'm just a little bit reserved for now since the author themselves aren't providing any transformer benchmarks, nor have they compared their CNN baselines to superconvergence, which is the goto standard for fast training for CNNs. Likewise <a href="https://parameterfree.com/2023/08/30/yet-another-icml-award-fiasco/" rel="nofollow">https://parameterfree.com/2023/08/30/yet-another-icml-award-...</a> wasn't pleasant.