Gato, a Decision Transformer on steroids, is pretty much what you would expect, with the expected RL scaling curves†, if you've been following ML scaling research for the past 2 years. It is, however, still mindblowing to see it in reality.<p>And note that it's only as small (and thus, weak) as it is because they want to run it directly on robots ("We focus our training at the operating point of model scale that allows real-time control of real-world robots, currently around 1.2B parameters").<p>† <a href="https://storage.googleapis.com/deepmind-media/A%20Generalist%20Agent/Generalist%20Agent.pdf#page=11" rel="nofollow">https://storage.googleapis.com/deepmind-media/A%20Generalist...</a> looks just like any scaling curve from a text or vision paper...<p>Also submitted at <a href="https://news.ycombinator.com/item?id=31355657" rel="nofollow">https://news.ycombinator.com/item?id=31355657</a>
This work has two really interesting contributions, in my opinion.<p>1. Creating a few data points (3) for scaling laws (Figure 8). These behave similar to language models, as gwern puts it [1], but across three data points, it's a bit tough to draw a power-law conclusion (eyeballing the figure, they increase params 4.5x and 3.2x and see about 20% relative performance improvement from each jump).<p>2. What I find more interesting than the scaling is the out-of-distribution (OOD) generalization results (Figure 9). They test the performance of the agent on a completely unseen task (though possibly from within the same domain, i.e., they might train on a fixed physics engine from the DeepMind Control Suite [2], but never let the agent look at the cartpole task). They compare this to various ablations: from-scratch training with the same architecture, pretraining only with same-domain data, and pretraining only on non-control data (presumably unsupervised contrastive-learning based data).<p>The results from (1) are impressive and from (2) are mixed (but no less interesting as a contribution!) in terms of the additional training data actually helping with generalization. The reason OOD generalization performance is most interesting is because it really tests whether control-based pretraining helps the agent in a truly new situation. And certainly, there are a couple tasks at which the zero-shot performance improves over the ablations (but there are others where it hurts).<p>What I'd find exciting to see in coming research is further investigation into variants of Figure 9.<p>- How does scaling affect the impact of control-data pretraining vs non-control data pretraining?<p>- The authors used a custom fine-tuning schedule for the few-shot evaluation on unseen tasks. It's possible the schedule needs to be changed for the ablated versions of the agents to give them the best performance, too. What would Figure 9 look like with the "best" training setup for each ablation individually? I.e., can we tease apart how much, if at all, it's a matter of low-level modality-specific features helping zero-shot adaptation vs some kind of truly generalized "control pretraining"?<p>[1] <a href="https://news.ycombinator.com/item?id=31356155" rel="nofollow">https://news.ycombinator.com/item?id=31356155</a>
[2] <a href="https://arxiv.org/abs/1801.00690" rel="nofollow">https://arxiv.org/abs/1801.00690</a>