This seems to ignore the mixed record of video generation models:<p><pre><code> For visual reasoning practice, we can do supervised fine-tuning on sequences similar to the marble example above. For instance, to understand more about the physical world, we can show the model sequential pictures of Slinkys going down stairs, or basketball players shooting 3-pointers, or people hammering birdhouses together....
But where will we get all this training data? For spatial and physical reasoning tasks, we can leverage computer graphics to generate synthetic data. This approach is particularly valuable because simulations provide a controlled environment where we can create scenarios with known outcomes, making it easy to verify the model's predictions. But we'll also need real-world examples. Fortunately, there's an abundance of video content online that we can tap into. While initial datasets might require human annotation, soon models themselves will be able to process videos and their transcripts to extract training examples automatically.
</code></pre>
Almost every video generator makes constant "folk physics" errors and doesn't understand object permanence. DeepMind's Veo2 is very impressive but still struggles with object permanence and qualitatively nonsensical physics: <a href="https://x.com/Norod78/status/1894438169061269750" rel="nofollow">https://x.com/Norod78/status/1894438169061269750</a><p>Humans do not learn these things by pure observation (newborns understand object permanence, I suspect this is the case for all vertebrates). I doubt transformers are capable of learning it as robustly, even if trained on all of YouTube. There will always be "out of distribution" physical nonsense involving mistakes humans (or lizards) would never make, even if they've never seen the specific objects.