People are obviously already pointing out the errors in various physical interactions shown in the demo videos, including the research team themselves, and I think the plausiblity of the generated videos will likely improve as they work on the model more. However, I think the major reason this generation -> simulation leap might be harder leap than they think is actually a plausibility/accuracy distinction. Generative models are general and versatile compared to predictive models, but they're intrinsically learning an objective that assesses its extrapolations on spatial or sequential (or in the case of video, both) plausibility, which has a lot more degrees of freedom than accuracy. In other words, the ability to create reasonable-enough hypotheses for what the next frame or the next pixel over could end up not being enough. The optimistic scenario is that it's possible to get to a simulation by narrowing this hypothesis-space enough to accurately model reality. In other words, it's possible that this is just something that could fall out of the plausibility being continuously improved, like the subset of plausible hypotheses shrinks as the model gets better, and eventually we get a reality-predictor, but I think there are good reasons to think that's far from guaranteed. I'd be curious to see what happens if you restrict training data to unaltered camera footage rather than allowing anything fictitious, but the least optimistic possibility is that this kind of capability is necessary but not sufficient for adequate prediction (or slightly more optimistically, can only do so with amounts of resolution that are currently infeasible, or something).<p>Some of the reasons the less optimistic scenarios seem likely is that the kinds of extrapolation errors this model makes are of similar character to those of LLMs: extrapolation follows a gradient of smooth apparent transitions rather than some underlying logic about the objects portrayed, and sometimes seems to just sort of ignore situations that are far enough outside of what it's seen rather than reconcile them. For example, the tidal wave/historical hall example is a scenario unlikely to have been in the training data. Sure, there's the funny bit at the end where the surfer appears to levitate in the air, but there's a much larger issue with how these two contrasting scenes interact, or rather fail to. What we see looks a lot more like a scene of surfing superimposed via photoshop or something on a still image of the hall, as there's no evidence of the water interacting with the seats or walls in the hall at all. The model will just roll with whatever you tell it to do as best it can, but it's not doing something like modeling "what would happen if" that implausible scenario played out, and even doing it poorly would be a better sign for this doing something like "simulating" the described scenario. Instead, we have impressive results for prompts that likely strongly correspond to scenes the model may have seen, and evidence of a lack of composition in cases where a particular composition is unlikely to have been seen and needs some underlying understanding of how it "would" work that is visible to us