TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Magma: A foundation model for multimodal AI agents

305 点作者 SerCe3 个月前

14 条评论

jwyang3 个月前
Thanks for your great interests on our Magma work, everyone!<p>We will gradually roll out the inference&#x2F;training&#x2F;evaluation&#x2F;data preprocessing code on our codebase: <a href="https:&#x2F;&#x2F;github.com&#x2F;microsoft&#x2F;Magma">https:&#x2F;&#x2F;github.com&#x2F;microsoft&#x2F;Magma</a>, and this will be finished by next Tuesday. Stay tunned!
评论 #43113891 未加载
ygouzerh3 个月前
The rate of progress on multimodal agents is impressive. OpenVLA was released in June 2024 and was state of the art at that time... 8 months later, on tasks like &quot;Pick Place Hotdog Sausage&quot; the success rate is passing from 2&#x2F;10 to 6&#x2F;10
评论 #43113450 未加载
erikig3 个月前
The multimodal capabilities especially on next action prediction are quite impressive; watching the github to see if &amp; when they&#x27;ll open source this: <a href="https:&#x2F;&#x2F;github.com&#x2F;microsoft&#x2F;Magma">https:&#x2F;&#x2F;github.com&#x2F;microsoft&#x2F;Magma</a><p>Also, I wonder why they named it Magma?
评论 #43111467 未加载
评论 #43111461 未加载
评论 #43114194 未加载
评论 #43111731 未加载
评论 #43111625 未加载
评论 #43119167 未加载
Oras3 个月前
Looking at industrial robots they don&#x27;t mimic how humans do things, and hence, they are efficient. That&#x27;s why I don&#x27;t understand how these propsals to teach robots how humans do things will make any sense.<p>To have robots at homes, they will need their tools to be efficient. It will not be the same washing machine, oven, or dishwasher that we use now, there will be new ones made for robots.
评论 #43112668 未加载
评论 #43113074 未加载
评论 #43112651 未加载
评论 #43112591 未加载
评论 #43112554 未加载
评论 #43113785 未加载
评论 #43112768 未加载
评论 #43116101 未加载
评论 #43114875 未加载
评论 #43112553 未加载
评论 #43112618 未加载
评论 #43113652 未加载
sorz3 个月前
In the mug-scrubbing video, the person clearly <i>pretends</i> to wash the cup but does not seem to want to get their hands wet anyway. I&#x27;m curious as to when models can figure out that subtle thing.
评论 #43112536 未加载
评论 #43112906 未加载
lelag3 个月前
Really interesting model, I&#x27;m looking forward to play with it.<p>But what I want is a multimodal agent model capable of generating embeddings for a humanoid control model like Meta motivo[0] rather than directly outputting coordinates.<p>Meta motivo is still a toy model, trained on the SMPL skeleton, which lacks fingers which limits its capabilities beside having some fun with it. They could have used a more advanced based model, SMPL-X, which includes fingers, but there isn’t enough open motion data with precise finger motion to train a robust manipulation model anyway.<p>Most existing motion datasets come from academic motion capture setups, which are complex and not focused on manipulation tasks (and also pretty old). I believe this gap will be filled by improvements in 3D HPE from 2D video. With access to thousands of hours of video, we can build large-scale motion datasets covering a wide range of real-world interactions.<p>This will enable training the two components needed for dexterous humanoid robots: the agentic model that decides what actions to take and generates embeddings that can be read by a control model that accurately models hand and finger joint movement.<p>Given the rapid progress in the capabilities of SoTA 3D HPE from 2D video, and the vast amount of videos online (Youtube), I expect we will see humanoid robots with good manipulation capabilities it the not so distant future.<p>[0]: <a href="https:&#x2F;&#x2F;github.com&#x2F;facebookresearch&#x2F;metamotivo">https:&#x2F;&#x2F;github.com&#x2F;facebookresearch&#x2F;metamotivo</a>
评论 #43114172 未加载
评论 #43114115 未加载
bilsbie3 个月前
Why do no multimodels fluidly create images. It seems like they pass off to another model to generate images? They’re not really aware what’s in the images they make and the can edit images in place.
评论 #43114281 未加载
yurimo3 个月前
Multimodal agents notoriously fail at long horizon tasks, how does Magma perform on it?
评论 #43112220 未加载
kittikitti3 个月前
These benchmarks are not really representative of what agents are capable of. The slow process of checking the weather through UI elements is not a good use case which is non-peer reviewed paper showcases.
Mizza3 个月前
Have any multimodal models been reasoning-trained yet?
评论 #43113113 未加载
funnyAI3 个月前
Just wondering if there is any research done in incremental training? That could be used in robots as alternative to RAG.
bob_theslob6463 个月前
Am I the only one that read that title in Dr.Evil&#x27;s voice?<p>All kidding aside. This looks promising
digitaltrees3 个月前
They need to build an epistemology and theory of mind engine into models. We take it for granted when dealing with other humans that they can infer deep meaning, motivations, expectations of truth vs fiction. But these agents don’t do that and so will be awful collaborators until those behaviors are present
评论 #43111826 未加载
评论 #43112482 未加载
评论 #43112114 未加载
评论 #43112105 未加载
评论 #43111785 未加载
bosky1013 个月前
Spent 10 mins on the website, all the examples are single agent examples. There is 0 value add for yet another wrapper on an openai call, parading as an agent.<p>The whole point of agents is knowing what to do among potentially 100&#x27;s of intents and actions.<p>Disappointing.