Grokking is certainly an interesting phenomenon, but have practical applications of it been discovered yet?<p>I remember seeing grokking demonstrated for MNIST (are there any other non synthetic datasets for which it has been shown?), but the authors of that paper had to make the training data smaller and got a test error far below state of the art.<p>I'm very interested in this research, just curious about how practically relevant it is (yet).
I missed the beginning of the story. Why and when does grokking occur? It seems to be a case of reaching a new basin, casting doubt on the shallow basin hypothesis in over-parameterized neural networks? The last I checked all the extrema in such models were supposed to be good, and easy to reach?
GrokFast strongly reminds me of Stochastic Average Gradient Descent: <a href="https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L12.pdf" rel="nofollow">https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L12.pdf</a><p>Both use averaging.
I'm really annoyed that AI types are just stealing well established vocabulary from <i>everywhere</i> and assigning new arbitrary definitions to them.<p>You have countless LLMs, use one of them to generate new names that don't require ten billion new disambiguation pages in Wikipedia.
How does this differ from momentum in practice? Gradient momentum already applies an exponential-decay average to the gradients. The authors discuss how their approach differs from momentum in formula, but not how it differs in practice. Essentially, momentum and Adam and all other second order optimizers already have explored this intellectual space, so I’m not sure why this paper exists unless it has some practical applications on top of the existing practice.
I have a suspicion that this technique will prove most valuable for market oriented data sets (like price related time series), where there isn't necessarily that much massive data scale compared to text corpora, and where there are very tight limits on the amount of training data because you only want to include recent data to reduce the chances of market regime changes. This approach seems to shine when you don't quite have enough training data to completely map out the general case, but if you train for long enough naively, you can get lucky and fall into it.
Why only MNIST and a Graph CNN? Those are small and somewhat odd choices. Scale these days should be at least 100 million param models and something like OpenWebText as a dataset in my opinion. Not sure what the SoTA is for visionm but same argument there.