科技回声

14 条评论

jwyang3 个月前

Thanks for your great interests on our Magma work, everyone!We will gradually roll out the inference/training/evaluation/data preprocessing code on our codebase: <a href="https://github.com/microsoft/Magma">https://github.com/microsoft/Magma</a>, and this will be finished by next Tuesday. Stay tunned!

评论 #43113891 未加载

ygouzerh3 个月前

The rate of progress on multimodal agents is impressive. OpenVLA was released in June 2024 and was state of the art at that time... 8 months later, on tasks like "Pick Place Hotdog Sausage" the success rate is passing from 2/10 to 6/10

评论 #43113450 未加载

erikig3 个月前

The multimodal capabilities especially on next action prediction are quite impressive; watching the github to see if & when they'll open source this: <a href="https://github.com/microsoft/Magma">https://github.com/microsoft/Magma</a>Also, I wonder why they named it Magma?

评论 #43111467 未加载

评论 #43111461 未加载

评论 #43114194 未加载

评论 #43111731 未加载

评论 #43111625 未加载

评论 #43119167 未加载

Oras3 个月前

Looking at industrial robots they don't mimic how humans do things, and hence, they are efficient. That's why I don't understand how these propsals to teach robots how humans do things will make any sense.To have robots at homes, they will need their tools to be efficient. It will not be the same washing machine, oven, or dishwasher that we use now, there will be new ones made for robots.

评论 #43112668 未加载

评论 #43113074 未加载

评论 #43112651 未加载

评论 #43112591 未加载

评论 #43112554 未加载

评论 #43113785 未加载

评论 #43112768 未加载

评论 #43116101 未加载

评论 #43114875 未加载

评论 #43112553 未加载

评论 #43112618 未加载

评论 #43113652 未加载

sorz3 个月前

In the mug-scrubbing video, the person clearly pretends to wash the cup but does not seem to want to get their hands wet anyway. I'm curious as to when models can figure out that subtle thing.

评论 #43112536 未加载

评论 #43112906 未加载

lelag3 个月前

Really interesting model, I'm looking forward to play with it.But what I want is a multimodal agent model capable of generating embeddings for a humanoid control model like Meta motivo[0] rather than directly outputting coordinates.Meta motivo is still a toy model, trained on the SMPL skeleton, which lacks fingers which limits its capabilities beside having some fun with it. They could have used a more advanced based model, SMPL-X, which includes fingers, but there isn’t enough open motion data with precise finger motion to train a robust manipulation model anyway.Most existing motion datasets come from academic motion capture setups, which are complex and not focused on manipulation tasks (and also pretty old). I believe this gap will be filled by improvements in 3D HPE from 2D video. With access to thousands of hours of video, we can build large-scale motion datasets covering a wide range of real-world interactions.This will enable training the two components needed for dexterous humanoid robots: the agentic model that decides what actions to take and generates embeddings that can be read by a control model that accurately models hand and finger joint movement.Given the rapid progress in the capabilities of SoTA 3D HPE from 2D video, and the vast amount of videos online (Youtube), I expect we will see humanoid robots with good manipulation capabilities it the not so distant future.[0]: <a href="https://github.com/facebookresearch/metamotivo">https://github.com/facebookresearch/metamotivo</a>

评论 #43114172 未加载

评论 #43114115 未加载

bilsbie3 个月前

Why do no multimodels fluidly create images. It seems like they pass off to another model to generate images? They’re not really aware what’s in the images they make and the can edit images in place.

评论 #43114281 未加载

yurimo3 个月前

Multimodal agents notoriously fail at long horizon tasks, how does Magma perform on it?

评论 #43112220 未加载

kittikitti3 个月前

These benchmarks are not really representative of what agents are capable of. The slow process of checking the weather through UI elements is not a good use case which is non-peer reviewed paper showcases.

Mizza3 个月前

Have any multimodal models been reasoning-trained yet?

评论 #43113113 未加载

funnyAI3 个月前

Just wondering if there is any research done in incremental training? That could be used in robots as alternative to RAG.

bob_theslob6463 个月前

Am I the only one that read that title in Dr.Evil's voice?All kidding aside. This looks promising

digitaltrees3 个月前

They need to build an epistemology and theory of mind engine into models. We take it for granted when dealing with other humans that they can infer deep meaning, motivations, expectations of truth vs fiction. But these agents don’t do that and so will be awful collaborators until those behaviors are present

评论 #43111826 未加载

评论 #43112482 未加载

评论 #43112114 未加载

评论 #43112105 未加载

评论 #43111785 未加载

bosky1013 个月前

Spent 10 mins on the website, all the examples are single agent examples. There is 0 value add for yet another wrapper on an openai call, parading as an agent.The whole point of agents is knowing what to do among potentially 100's of intents and actions.Disappointing.

14 条评论

jwyang3 个月前

评论 #43113891 未加载

ygouzerh3 个月前

评论 #43113450 未加载

erikig3 个月前

评论 #43111467 未加载

评论 #43111461 未加载

评论 #43114194 未加载

评论 #43111731 未加载

评论 #43111625 未加载

评论 #43119167 未加载

Oras3 个月前

评论 #43112668 未加载

评论 #43113074 未加载

评论 #43112651 未加载

评论 #43112591 未加载

评论 #43112554 未加载

评论 #43113785 未加载

评论 #43112768 未加载

评论 #43116101 未加载

评论 #43114875 未加载

评论 #43112553 未加载

评论 #43112618 未加载

评论 #43113652 未加载

sorz3 个月前

In the mug-scrubbing video, the person clearly pretends to wash the cup but does not seem to want to get their hands wet anyway. I'm curious as to when models can figure out that subtle thing.

评论 #43112536 未加载

评论 #43112906 未加载

lelag3 个月前

评论 #43114172 未加载

评论 #43114115 未加载

bilsbie3 个月前

评论 #43114281 未加载

yurimo3 个月前

Multimodal agents notoriously fail at long horizon tasks, how does Magma perform on it?

评论 #43112220 未加载

kittikitti3 个月前

Mizza3 个月前

Have any multimodal models been reasoning-trained yet?

评论 #43113113 未加载

funnyAI3 个月前

Just wondering if there is any research done in incremental training? That could be used in robots as alternative to RAG.

bob_theslob6463 个月前

Am I the only one that read that title in Dr.Evil's voice?All kidding aside. This looks promising

Magma: A foundation model for multimodal AI agents

14 条评论

Magma: A foundation model for multimodal AI agents

14 条评论