It seems like these things could perform much better if the modeling was deliberately decomposed more and there was less emphasis on doing everything in one parallel step.<p>For example, translating the visual sampling into a 3d model first.
Or maybe some neural representations that can generate the 3d models. Then train the movement on that rather than raw pixels.<p>Similarly, for textual prompts of interactions, first create a model that relates the word embedding to the same 3d modeling and physics interactions.<p>Obviously much easier said than done.