Grokked Transformers Are Implicit Reasoners

239 pointsby jasondavies12 months ago

11 comments

scarmig12 months ago

Since I first learned about grokking, I've had a strong suspicion that getting a handle on it and figuring out how to aid it should be the central question in AI. We are currently stuck in a local minimum, where memorizing circuits perform well-enough to handle a whole lot of economically viable use cases. But the profit function has guided us into a valley dominated by a data and compute hungry architecture that isn't ideal for learning generalizing circuits (partially because the memorizing circuits are so effective! We relatively quickly get to a flat loss landscape, after which we blindly jump around for countless epochs in a kind of Brownian motion until we get into an area where regularizers can drive generalization). Research like this paper is incredibly important.I thought this was the most interesting bit from the paper:> Training data distribution, instead of training data size, qualitatively influences generalization behavior.

评论 #40501121 未加载

评论 #40502381 未加载

taneq12 months ago

Reminds me of that old quote about “the difference between average and state of the art is forgetting to turn it off over summer break” or similar.I wonder if this is why smaller LLMs seem to punch above their weight, are they further along in the process of distilling the data down into understanding?

评论 #40499928 未加载

nico12 months ago

Conceptually, grokking reminds me of the concept presented in the book The DipMost people, for most tasks, will only learn/train/try to improve, up to where they get to a flat or negative return curve per unit of effort put inBut, the people that are the best at a certain task, usually implies they got through The Dip in the curve of return per effort

评论 #40505094 未加载

Scene_Cast212 months ago

I just learned about grokking; reminds me of double descent, and I looked up a 2022 paper called "Unifying grokking and double descent". I'm still unclear on what the difference is. My basic understanding of double descent was that the regularization loss made the model focus on regularization after fitting the train data.

评论 #40497324 未加载

PoignardAzur12 months ago

This paper feels way too abstract, to the point it makes it hard to understand what the team actually did.For instance, the paper claims it beat GPT-4-Turbo and Gemini-Pro-1.5 on certain tasks... but it doesn't include any of the questions they asked GPT4 or Gemini, so it's hard to guess whether these results have any value at all.It's also unclear what they even trained their custom transformer to do. It has a custom tokenizer, but they don't give a list of tokens (aside from a few examples in the diagrams like "Barrack", "Michelle", "Trump"). They talk about in-distribution and out-of-distribution tasks, but they don't give any examples of these tasks and what they look like.This feels like accidental complexity. It wouldn't have been hard to add a few more appendices with eg a list of 20 or so in-distribution sentences they asked the model to complete and 10 out-of-distribution sentences. Instead all they include is diagrams comparing performance for different hyperparameters and stuff, but we don't even know what the models are being tested on.

评论 #40500077 未加载

评论 #40500753 未加载

评论 #40499627 未加载

tysam_and12 months ago

I sort of wish that we would move on from the "grokking" terminology in the way that the field generally uses it (a magical kind of generalization that may-or-may-not-suddenly-happen if you train for a really long time).I generally regard grokking as a failure mode in a lot of cases -- it's oftentimes not really a good thing. It tends to indicate that the combination of your network, task, and data are poorly suited for learning {XYZ} thing. There are emergent traits which I think the network can learn in a healthy manner over training, and I think that tends to fall under the 'generalization' umbrella.Though I'd strongly prefer to call it 'transitive' rather than 'compositional' in terms of generalization, as transitive is the formal term most disciplines use for such things, compositional is a different, more general meaning entirely. Similarly, I'd replace 'parametric' and 'non-parametric' with 'internal' and 'external', etc. Sloughing through the definition salad of words (this paper alone takes up roughly half of the top Kagi hits for 'parametric memory') makes actually interpreting an argument more difficult.One reinterpretation of the problem is -- of course external memory models will have trouble generalizing to certain things like models relying on internal memory do! This is because, in part, models with internal memory will have much more 'experience' integrating the examples that they've seen, whereas, for an external-memory model like a typical RAG setup, anything is possible.But, that being said, I don't think you can necessarily isolate that to the type of memory that the model has alone, i.e., I don't think you can clearly say even in a direct comparison between the two motifs that it's the kind of memory itself (internal vs. external) that is to blame for this. I think that might end up leading down some unfruitful research paths if so.That said, one positive about this paper is the fact that they seem to have found a general circuit that forms for their task, and analyze that, I believe that has value, but (and I know I tend to be harsh on papers generally) the rest of the paper seems to be more of a distraction.Definitional salad buffets and speculation about the 'in' topics are going to be the things that make the headlines, but in order to make real progress, focusing on the fundamentals is really what's necessary here, I think. They may seem 'boring' a lot of the times, but they've certainly helped me quite a bit in my research. <3 :'))))

imtringued12 months ago

One of the biggest bottlenecks of multi layer transformers is that reasoning can only happen in the hidden layers. Past the final layer, the model must generate a token that conforms to the training process. This token can then be fed back into the transformer from the beginning, but since it necessarily must be in natural language, it limits the type of reasoning the model can perform to the "thoughts" it has seen in the dataset and is therefore allowed to express. If you could figure out how to have the first layer take both the KV of the first layer and the KV of the final layers in the attention mechanism into account, the model would become capable of infinite length reasoning.

评论 #40504563 未加载

vsaustraliabtc12 months ago

How do we actually implement this? Imma struggling to work out how I could use this rather than my stupid langraph, recursive rag checking crap that takes too much time and never really does justice.

sturza12 months ago

Grokking is all you need?

评论 #40499370 未加载

评论 #40498703 未加载

campers12 months ago

This is the interesting result where their GPT-2 sized transformer blows away GPT4 and Gemini 1.5 in connecting together facts<pre><code> The difficulty of such a task is two-fold. First, the search space is large. For example, on average, each query entity connects with more than 50 facts, and each bridge entity in the ground truth proof connects with more than 900 facts. Second, there are no surface form clues to exploit and bias the search towards the ground truth proof, unlike most conventional QA benchmarks where the proof steps are transparent from the query. To test LLMs based on non-parametric memory, we translate the facts into natural language by simple templates (Appendix F). Facts/queries for each attribute are grouped/tested separately. We test both the vanilla setup where all facts (28.2K on average) are loaded into the LLM context, and the retrieval-augmented setup (5.4K facts retrieved on average) where the two-hop neighborhoods of the two query entities are retrieved, which includes enough facts to deduce the answer. We also try both standard prompting where the model answers directly, and chain-of-thought (CoT) prompting where the model is prompted to verbalize the reasoning. We test GPT-4-Turbo and Gemini-Pro-1.5, where for GPT-4-Turbo we only test the retrieval-augmented setup due to context length limit. Table 1:Results on the complex reasoning task. Direct/CoT: predict the answer directly/verbalize the reasoning steps. “+R”: retrieval augmentation. GPT-4-Turbo Gemini-Pro-1.5 Grokked Transformer Direct+R CoT+R Direct CoT Direct+R. CoT+R Accuracy (%) 33.3 31.3 28.7 11.3 37.3 12.0 99.3</code></pre>

评论 #40502157 未加载

评论 #40500908 未加载

syntaxfree12 months ago

> delve

评论 #40496807 未加载

评论 #40497471 未加载

评论 #40497679 未加载