TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Grokfast: Accelerated Grokking by Amplifying Slow Gradients

117 pointsby johnsutor12 months ago

8 comments

svara12 months ago
Grokking is certainly an interesting phenomenon, but have practical applications of it been discovered yet?<p>I remember seeing grokking demonstrated for MNIST (are there any other non synthetic datasets for which it has been shown?), but the authors of that paper had to make the training data smaller and got a test error far below state of the art.<p>I&#x27;m very interested in this research, just curious about how practically relevant it is (yet).
评论 #40567887 未加载
评论 #40568695 未加载
评论 #40569174 未加载
评论 #40576212 未加载
esafak12 months ago
I missed the beginning of the story. Why and when does grokking occur? It seems to be a case of reaching a new basin, casting doubt on the shallow basin hypothesis in over-parameterized neural networks? The last I checked all the extrema in such models were supposed to be good, and easy to reach?
评论 #40568693 未加载
评论 #40569223 未加载
thesz12 months ago
GrokFast strongly reminds me of Stochastic Average Gradient Descent: <a href="https:&#x2F;&#x2F;www.cs.ubc.ca&#x2F;~schmidtm&#x2F;Courses&#x2F;540-W19&#x2F;L12.pdf" rel="nofollow">https:&#x2F;&#x2F;www.cs.ubc.ca&#x2F;~schmidtm&#x2F;Courses&#x2F;540-W19&#x2F;L12.pdf</a><p>Both use averaging.
utensil477812 months ago
I&#x27;m really annoyed that AI types are just stealing well established vocabulary from <i>everywhere</i> and assigning new arbitrary definitions to them.<p>You have countless LLMs, use one of them to generate new names that don&#x27;t require ten billion new disambiguation pages in Wikipedia.
aoeusnth112 months ago
How does this differ from momentum in practice? Gradient momentum already applies an exponential-decay average to the gradients. The authors discuss how their approach differs from momentum in formula, but not how it differs in practice. Essentially, momentum and Adam and all other second order optimizers already have explored this intellectual space, so I’m not sure why this paper exists unless it has some practical applications on top of the existing practice.
eigenvalue12 months ago
I have a suspicion that this technique will prove most valuable for market oriented data sets (like price related time series), where there isn&#x27;t necessarily that much massive data scale compared to text corpora, and where there are very tight limits on the amount of training data because you only want to include recent data to reduce the chances of market regime changes. This approach seems to shine when you don&#x27;t quite have enough training data to completely map out the general case, but if you train for long enough naively, you can get lucky and fall into it.
curious_cat_16312 months ago
Cute! The signal processing folks have entered the room... :)
buildbot12 months ago
Why only MNIST and a Graph CNN? Those are small and somewhat odd choices. Scale these days should be at least 100 million param models and something like OpenWebText as a dataset in my opinion. Not sure what the SoTA is for visionm but same argument there.
评论 #40568025 未加载
评论 #40569514 未加载
评论 #40568800 未加载
评论 #40574549 未加载
评论 #40569233 未加载