This work has two really interesting contributions, in my opinion.<p>1. Creating a few data points (3) for scaling laws (Figure 8). These behave similar to language models, as gwern puts it [1], but across three data points, it's a bit tough to draw a power-law conclusion (eyeballing the figure, they increase params 4.5x and 3.2x and see about 20% relative performance improvement from each jump).<p>2. What I find more interesting than the scaling is the out-of-distribution (OOD) generalization results (Figure 9). They test the performance of the agent on a completely unseen task (though possibly from within the same domain, i.e., they might train on a fixed physics engine from the DeepMind Control Suite [2], but never let the agent look at the cartpole task). They compare this to various ablations: from-scratch training with the same architecture, pretraining only with same-domain data, and pretraining only on non-control data (presumably unsupervised contrastive-learning based data).<p>The results from (1) are impressive and from (2) are mixed (but no less interesting as a contribution!) in terms of the additional training data actually helping with generalization. The reason OOD generalization performance is most interesting is because it really tests whether control-based pretraining helps the agent in a truly new situation. And certainly, there are a couple tasks at which the zero-shot performance improves over the ablations (but there are others where it hurts).<p>What I'd find exciting to see in coming research is further investigation into variants of Figure 9.<p>- How does scaling affect the impact of control-data pretraining vs non-control data pretraining?<p>- The authors used a custom fine-tuning schedule for the few-shot evaluation on unseen tasks. It's possible the schedule needs to be changed for the ablated versions of the agents to give them the best performance, too. What would Figure 9 look like with the "best" training setup for each ablation individually? I.e., can we tease apart how much, if at all, it's a matter of low-level modality-specific features helping zero-shot adaptation vs some kind of truly generalized "control pretraining"?<p>[1] <a href="https://news.ycombinator.com/item?id=31356155" rel="nofollow">https://news.ycombinator.com/item?id=31356155</a>
[2] <a href="https://arxiv.org/abs/1801.00690" rel="nofollow">https://arxiv.org/abs/1801.00690</a>