Helix: A vision-language-action model for generalist humanoid control

303 点作者 Philpax3 个月前

36 条评论

It seems that end to end neural networks for robotics are really taking off. Can someone point me towards where to learn about these, what the state of the art architectures look like, etc? Do they just convert the video into a stream of tokens, run it through a transformer, and output a stream of tokens?

评论 #43119388 未加载

yurimo3 个月前

I don't know, there has been so many overhyped and faked demos in humanoid robotics space over the last couple years, it is difficult to believe what is clearly a demo release for shareholders. Would love to see some demonstration in a less controlled environment.

评论 #43120695 未加载

评论 #43117194 未加载

causal3 个月前

I'm always wondering at the safety measures on these things. How much force is in those motors?This is basically safety-critical stuff but with LLMs. Hallucinating wrong answers in text is bad, hallucinating that your chest is a drawer to pull open is very bad.

评论 #43117966 未加载

评论 #43128272 未加载

评论 #43122792 未加载

评论 #43117802 未加载

评论 #43118028 未加载

Symmetry3 个月前

So, there's no way you can have fully actuated control of every finger joint with just 35 degrees of freedom. Which is very reasonable! Humans can't individually control each of our finger joints either. But I'm curious how their hand setups work, which parts are actuated and which are compliant. In the videos I'm not seeing any in-hand manipulation other than just grasping, releasing, and maintaining the orientation of the object relative to the hand and I'm curious how much it can do / they plan to have it be able to do. Do they have any plans to try to mimic OpenAI's one handed rubics cube demo?

wwwtyro3 个月前

Until we get robots with really good hands, something I'd love in the interim is a system that uses _me_ as the hands. When it's time to put groceries away, I don't want to have to think about how to organize everything. Just figure out which grocery items I have, what storage I have available, come up with an optimized organization solution, then tell me where to put things, one at a time. I'm cautiously optimistic this will be doable in the near term with a combination of AR and AI.

评论 #43116251 未加载

评论 #43116384 未加载

评论 #43118482 未加载

评论 #43118268 未加载

评论 #43120550 未加载

评论 #43117263 未加载

评论 #43116360 未加载

评论 #43116311 未加载

评论 #43117748 未加载

评论 #43116351 未加载

评论 #43118459 未加载

评论 #43119341 未加载

ziofill3 个月前

There’s nothing I want more than a robot that does house chores. That’s the real 10x multiplier for humans to do what they do best.

评论 #43118413 未加载

评论 #43118071 未加载

评论 #43116118 未加载

评论 #43117914 未加载

评论 #43117186 未加载

评论 #43116148 未加载

评论 #43117674 未加载

plipt3 个月前

The demo is quite interesting but I am mostly intrigued by the claim that it is running totally local to each robot. It seems to use some agentic decision making but the article doesn't touch on that. What possible combo of model types are they stringing together? Or is this something novel?The article mentions that the system in each robot uses two ai models.<pre><code> S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data </code></pre> and the other<pre><code> S1, an 80M parameter cross-attention encoder-decoder transformer, handles low-level [motor?] control. </code></pre> It feels like although the article is quite openly technical they are leaving out the secret sauce? So they use an open source VLM to identify the objects on the counter. And another model to generate the mechanical motions of the robot.What part of this system understands 3 dimensional space of that kitchen?How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?<pre><code> Figure robots, each equipped with dual low-power-consumption embedded GPUs </code></pre> Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?

评论 #43122687 未加载

评论 #43191376 未加载

评论 #43119741 未加载

verytrivial3 个月前

Are they claiming these robots are also silent? They seem to have "crinkle" sounds handling packaging, which if added in post seems needlessly smoke-and-mirror for what was a very impressive demonstration (of robots impersonating an extreme stoned human.)

bilsbie3 个月前

This is amazing but it also made me realize I just don’t trust these videos. Is it sped up? How much is preprogrammed?I now they claim there’s no special coding but did they practice this task? Special training?Even if this video is totally legit I’m but burned out by all the hype videos in general.

评论 #43116643 未加载

评论 #43116603 未加载

aerodog3 个月前

Interesting timing - same day MSFT releases <a href="https://microsoft.github.io/Magma/" rel="nofollow">https://microsoft.github.io/Magma/</a>

评论 #43117578 未加载

pr337h4m3 个月前

Goal 2 has been achieved, at least as a proof of concept (and not by OpenAI): <a href="https://openai.com/index/openai-technical-goals/" rel="nofollow">https://openai.com/index/openai-technical-goals/</a>

评论 #43118546 未加载

sandis3 个月前

YouTube link for the video (for whatever reason the video hosted on their site kept buffering for me): <a href="https://www.youtube.com/watch?v=Z3yQHYNXPws" rel="nofollow">https://www.youtube.com/watch?v=Z3yQHYNXPws</a>

ge963 个月前

Wonder what their vision stack is like. Depth via sensors or purely visual and the distance estimating of objects and inverse kinematics/proprioception, anyway it looks impressive.

sottol3 个月前

Imo, the Terminator movies would have been scarier if they moved like these guys - slow, careful, deliberate and measured but unstoppable. There's something uncanny about this.

评论 #43116860 未加载

kla-s3 个月前

Does anyone know how long they have been at this? Is this mainly a reimplementation of the physical intelligence paper + the dual size/freq + the cooperative part?

评论 #43116481 未加载

bhouston3 个月前

When doing robot control, how do you model in the control of the robot? Do you have tool_use / function calling at the top level model which then gets turned into motion control parameters via inverse kinematic controllers?What is the interface from the top level to the motors?I feel it can not just be a neural network all the way down, right?

评论 #43125295 未加载

评论 #43116121 未加载

andiareso3 个月前

Seriously, what's with all of these perceived "high-end" tech companies not doing static content worth a damn.Stop hosting your videos as MP4s on your web-server. Either publish to a CDN or use a platform like YouTube. Your bandwidth cannot handle serving high resolution MP4s./rant

评论 #43125999 未加载

评论 #43123144 未加载

traverseda3 个月前

"The first time you've seen these objects" is a weird thing to say. One presumes that this is already in their training set, and that these models aren't storing a huge amount of data in their context, so what does that even mean?

评论 #43116135 未加载

评论 #43117686 未加载

评论 #43116245 未加载

swalsh3 个月前

At this point, this is enough autonomy to have a set of these guys man a howitzer (read as old stockpiles of weapons we already have). Kind of a scary thought. On one hand, I think the idea of moving real people out of danger in war is a good idea, and as an American i'd want Americans to have an edge... and we can't guarantee our enemies won't take it if we skip it, on the other hand I have a visceral reaction to machines killing people.I think we're at an inflection point now where AI and robotics can be used in warfare, and we need to start having that conversation.

评论 #43117951 未加载

评论 #43118390 未加载

评论 #43118882 未加载

评论 #43136102 未加载

评论 #43125250 未加载

评论 #43118518 未加载

评论 #43125036 未加载

ramenlover3 个月前

Why do they make “eye contact” after every hand off? Feels oddly forced.

评论 #43119551 未加载

评论 #43122462 未加载

评论 #43116455 未加载

bbor3 个月前

To focus on something other than the obviously-terrifying nature of this and the skepticism that rightfully entails on our part:<pre><code> A fast reactive visuomotor policy that translates the latent semantic representations produced by S2 into precise continuous robot actions at 200 Hz </code></pre> Why 200Hz...? Any experts in here on robotics? Because to this layman that seems really often to update motor controls.

Animats3 个月前

"Pick up anything: Figure robots equipped with Helix can now pick up virtually any small household object, including thousands of items they have never encountered before, simply by following natural language prompts."If they can do that, why aren't they selling picking systems to Amazon by the tens of thousands?

评论 #43121801 未加载

评论 #43121708 未加载

bilsbie3 个月前

I get the impression there’s a language model sending high level commands to a control model? I wonder when we can have one multimodal model that controls everything.The latest models seemed to be fluidly tied in with generating voice; even singing and laughing.It seems like it would be possible to train a multimodal that can do that with low level actuator commands.

评论 #43116690 未加载

ripped_britches3 个月前

This whole thread is just people who didn’t read the technical details or immediately doubt the video’s honesty.I’m actually fairly impressed with this because it’s one neural net which is the goal, and the two system paradigm is really cool. I don’t know much about robotics but this seems like the right direction.

ianamo3 个月前

Are we at a point now where Asimov’s laws are programmed into these fellas somewhere?

评论 #43119583 未加载

exe343 个月前

Is there a paper? I think I get how they did their training, but I'd like to understand it more.Does anyone know if this trained model would work on a different robot at all, or would it need retraining?

the_other3 个月前

It’s funny… there a lot of comments here asking “why would anyone pay for this, when you could learn to do the thing, or organise your time/plans yourself.”That’s how I feel about LLMs and code.

kingkulk3 个月前

Anyone have a link to their paper?

IAmNotACellist3 个月前

I don't suppose this is open research and I can read about their model architecture?

ein0p3 个月前

There's no way this is 100% real though. No startup demo ever is.

bilsbie3 个月前

They should have made them talk. It’s a little dehumanizing otherwise.

butifnot07013 个月前

It's kinda eerie how they look at each other after handover

anentropic3 个月前

Very impressiveWhy make such sinister-looking robots though...?

评论 #43116393 未加载

评论 #43116783 未加载

kubb3 个月前

Wow! This is something new.

dr_dshiv3 个月前

Wake me when robots can make a peanut butter sandwich

abraxas3 个月前

Is this even reality or CGI? They really should show these things off in less sterile environemtns because this video has a very CGI feel to it.

评论 #43117210 未加载

36 条评论

porphyra3 个月前

评论 #43119388 未加载

yurimo3 个月前

评论 #43120695 未加载

评论 #43117194 未加载

causal3 个月前

评论 #43117966 未加载

评论 #43128272 未加载

评论 #43122792 未加载

评论 #43117802 未加载

评论 #43118028 未加载

Symmetry3 个月前

wwwtyro3 个月前

评论 #43116251 未加载

评论 #43116384 未加载

评论 #43118482 未加载

评论 #43118268 未加载

评论 #43120550 未加载

评论 #43117263 未加载

评论 #43116360 未加载

评论 #43116311 未加载

评论 #43117748 未加载

评论 #43116351 未加载

评论 #43118459 未加载

评论 #43119341 未加载

ziofill3 个月前

There’s nothing I want more than a robot that does house chores. That’s the real 10x multiplier for humans to do what they do best.

评论 #43118413 未加载

评论 #43118071 未加载

评论 #43116118 未加载

评论 #43117914 未加载

评论 #43117186 未加载

评论 #43116148 未加载

评论 #43117674 未加载

plipt3 个月前

评论 #43122687 未加载

评论 #43191376 未加载

评论 #43119741 未加载

verytrivial3 个月前

bilsbie3 个月前

评论 #43116643 未加载

评论 #43116603 未加载

aerodog3 个月前

Interesting timing - same day MSFT releases <a href="https://microsoft.github.io/Magma/" rel="nofollow">https://microsoft.github.io/Magma/</a>

评论 #43117578 未加载

pr337h4m3 个月前

评论 #43118546 未加载

sandis3 个月前

ge963 个月前

Wonder what their vision stack is like. Depth via sensors or purely visual and the distance estimating of objects and inverse kinematics/proprioception, anyway it looks impressive.

sottol3 个月前

Imo, the Terminator movies would have been scarier if they moved like these guys - slow, careful, deliberate and measured but unstoppable. There's something uncanny about this.

评论 #43116860 未加载

kla-s3 个月前

Does anyone know how long they have been at this? Is this mainly a reimplementation of the physical intelligence paper + the dual size/freq + the cooperative part?

评论 #43116481 未加载

bhouston3 个月前

评论 #43125295 未加载

评论 #43116121 未加载

andiareso3 个月前

评论 #43125999 未加载

评论 #43123144 未加载

traverseda3 个月前

评论 #43116135 未加载

评论 #43117686 未加载

评论 #43116245 未加载

swalsh3 个月前

评论 #43117951 未加载

评论 #43118390 未加载

评论 #43118882 未加载

评论 #43136102 未加载

评论 #43125250 未加载

评论 #43118518 未加载

评论 #43125036 未加载

ramenlover3 个月前

Why do they make “eye contact” after every hand off? Feels oddly forced.

评论 #43119551 未加载

评论 #43122462 未加载

评论 #43116455 未加载

bbor3 个月前

Animats3 个月前

评论 #43121801 未加载

评论 #43121708 未加载

bilsbie3 个月前

评论 #43116690 未加载

ripped_britches3 个月前

ianamo3 个月前

Are we at a point now where Asimov’s laws are programmed into these fellas somewhere?

评论 #43119583 未加载

exe343 个月前

the_other3 个月前

kingkulk3 个月前

Anyone have a link to their paper?

IAmNotACellist3 个月前

I don't suppose this is open research and I can read about their model architecture?

ein0p3 个月前

There's no way this is 100% real though. No startup demo ever is.

bilsbie3 个月前

They should have made them talk. It’s a little dehumanizing otherwise.

butifnot07013 个月前

It's kinda eerie how they look at each other after handover

anentropic3 个月前

Very impressiveWhy make such sinister-looking robots though...?

评论 #43116393 未加载

评论 #43116783 未加载

kubb3 个月前

Wow! This is something new.

dr_dshiv3 个月前

Wake me when robots can make a peanut butter sandwich

abraxas3 个月前

Is this even reality or CGI? They really should show these things off in less sterile environemtns because this video has a very CGI feel to it.

评论 #43117210 未加载