TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

RT-2: Vision-Language-Action Models (2023)

76 pointsby elsewhen5 months ago

6 comments

YeGoblynQueenne5 months ago
The problem with all the very impressive videos on that page is that we have no idea how many attempts were made before the robot could successfully, e.g. &quot;put strawberry into the correct bowl&quot;. In that task there&#x27;s four bowls, so a random choice of bowl would be correct 25% of the time. How many times did the robot put the strawberry e.g. in the bowl of apples? And that&#x27;s assuming &quot;the correct bowl&quot; is the one with the strawberries (which is a big assumption- the strawberry should go with the others of its kind: says who? How often can the robot put the strawberry in the bowl with the apples if that&#x27;s what we want it to do?).<p>Plotted results show around 50% average performance on &quot;unseen&quot; tasks, environments, objects etc, which sounds a lot like success follows some kind of random distribution. That&#x27;s not a great way to engender trust in the &quot;emergent&quot; abilities of a robotic system to generalise to unseen tasks etc. Blame bad statistics if you get a strawberry in the eye, or a banana in the ear.
评论 #42568745 未加载
评论 #42570000 未加载
modeless5 months ago
[2023]<p>Some of the authors have gone on to found a startup called Physical Intelligence: <a href="https:&#x2F;&#x2F;www.physicalintelligence.company&#x2F;blog&#x2F;pi0" rel="nofollow">https:&#x2F;&#x2F;www.physicalintelligence.company&#x2F;blog&#x2F;pi0</a>
评论 #42571307 未加载
评论 #42573084 未加载
mkagenius5 months ago
&gt; We represent the robot actions as text strings as shown below. An example of such a string could be a sequence of robot action token numbers: “1 128 91 241 5 101 127 217”.<p>Training with numbers like this might be a little problematic, I have tried to fine tune GPT 4o-mini with very little success(just me?)<p>On the other hand I found[1] Gemini and Molmo being able to locate elements on screen much better than 4o.<p>1. <a href="https:&#x2F;&#x2F;github.com&#x2F;BandarLabs&#x2F;clickclickclick">https:&#x2F;&#x2F;github.com&#x2F;BandarLabs&#x2F;clickclickclick</a>
评论 #42568711 未加载
byyoung35 months ago
this is a year and a half old
评论 #42567707 未加载
GaggiX5 months ago
(2023)
xnx5 months ago
Impressive work. Connecting with Nvidia&#x27;s move to make robotics there next focus, is there need to have powerful compute local to the robot? Cloud latency would seem to be fine for the speed of these robotic arms.