Really interesting model, I'm looking forward to play with it.<p>But what I want is a multimodal agent model capable of generating embeddings for a humanoid control model like Meta motivo[0] rather than directly outputting coordinates.<p>Meta motivo is still a toy model, trained on the SMPL skeleton, which lacks fingers which limits its capabilities beside having some fun with it. They could have used a more advanced based model, SMPL-X, which includes fingers, but there isn’t enough open motion data with precise finger motion to train a robust manipulation model anyway.<p>Most existing motion datasets come from academic motion capture setups, which are complex and not focused on manipulation tasks (and also pretty old). I believe this gap will be filled by improvements in 3D HPE from 2D video. With access to thousands of hours of video, we can build large-scale motion datasets covering a wide range of real-world interactions.<p>This will enable training the two components needed for dexterous humanoid robots: the agentic model that decides what actions to take and generates embeddings that can be read by a control model that accurately models hand and finger joint movement.<p>Given the rapid progress in the capabilities of SoTA 3D HPE from 2D video, and the vast amount of videos online (Youtube), I expect we will see humanoid robots with good manipulation capabilities it the not so distant future.<p>[0]: <a href="https://github.com/facebookresearch/metamotivo">https://github.com/facebookresearch/metamotivo</a>