I sort of wish that we would move on from the "grokking" terminology in the way that the field generally uses it (a magical kind of generalization that may-or-may-not-suddenly-happen if you train for a really long time).<p>I generally regard grokking as a failure mode in a lot of cases -- it's oftentimes not really a good thing. It tends to indicate that the combination of your network, task, and data are poorly suited for learning {XYZ} thing. There are emergent traits which I think the network can learn in a healthy manner over training, and I think that tends to fall under the 'generalization' umbrella.<p>Though I'd strongly prefer to call it 'transitive' rather than 'compositional' in terms of generalization, as transitive is the formal term most disciplines use for such things, compositional is a different, more general meaning entirely. Similarly, I'd replace 'parametric' and 'non-parametric' with 'internal' and 'external', etc. Sloughing through the definition salad of words (this paper alone takes up roughly half of the top Kagi hits for 'parametric memory') makes actually interpreting an argument more difficult.<p>One reinterpretation of the problem is -- of course external memory models will have trouble generalizing to certain things like models relying on internal memory do! This is because, in part, models with internal memory will have much more 'experience' integrating the examples that they've seen, whereas, for an external-memory model like a typical RAG setup, anything is possible.<p>But, that being said, I don't think you can necessarily isolate that to the type of memory that the model has alone, i.e., I don't think you can clearly say even in a direct comparison between the two motifs that it's the kind of memory itself (internal vs. external) that is to blame for this. I think that might end up leading down some unfruitful research paths if so.<p>That said, one positive about this paper is the fact that they seem to have found a general circuit that forms for their task, and analyze that, I believe that has value, but (and I know I tend to be harsh on papers generally) the rest of the paper seems to be more of a distraction.<p>Definitional salad buffets and speculation about the 'in' topics are going to be the things that make the headlines, but in order to make real progress, focusing on the fundamentals is really what's necessary here, I think. They may seem 'boring' a lot of the times, but they've certainly helped me quite a bit in my research. <3 :'))))