Whenever I see these text to video diffusion models, I always think that a much better result would come out of some sort of multi-modal vision-aware AI agent were able to use 3d modeling tools like blender, continuously iterate it, work with physics and particle simulation tools then refine the less realistic details with a diffusion model based filter. It would also allow all the assets created along the way to be reused and refined.<p>I imagine it will be the de-facto way AI generates video in the next few years, once huge models are smart enough to use tools and work in high breadth and depth jobs like humans do.