I'm not sure folks who're putting out strong takes based on this have read this paper.<p>This paper uses GPT-2 transformer scale, on sinusoidal data:<p>>We trained a decoder-only Transformer [7] model of GPT-2 scale implemented in the Jax based machine learning framework, Pax4 with 12 layers, 8 attention heads, and a 256-dimensional embedding space (9.5M parameters) as our base configuration [4].<p>> Building on previous work, we investigate this question in a controlled setting, where we study transformer models trained on sequences of (x,f(x)) pairs rather than natural language.<p>Nowhere near definitive or conclusive.<p>Not sure why this is news outside of the Twitter-techno-pseudo-academic-influencer bubble.